distributed information retrieval server ranking for distributed text retrieval systems on the...

10
Distributed Distributed Information Retrieval Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further Experiments with Database Merging E. Vorhees Brian Shaw CS 5604

Upload: allan-crawford

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Distributed Information Distributed Information RetrievalRetrieval

Server Ranking for Distributed Text Retrieval Systems on the Internet

B. Yuwono and D. Lee

Siemens TREC-4 Report: Further Experiments with Database Merging

E. VorheesBrian Shaw

CS 5604

Page 2: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Issue: Merging for Effective Issue: Merging for Effective ResultsResults

multiple brokers (take search queries), multiple collection servers

broker must select appropriate collection servers and merge results

Page 3: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Server Ranking:Server Ranking: overview… overview…

Problem: “cost” (including user’s time) of broadcasting to all servers and processing power

Solution: broker ranks collection servers (“goodness score”); broadcasts query to at most σ (sigma) collection servers (preset number or scoring threshold); merges results

1- Server Ranking for Distributed Text Retrieval on the Internet

Page 4: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Server Ranking:Server Ranking: Server Server SelectionSelection

Relies solely on Document Frequency data (DF); all collection servers must report changes to broker

Cue Validity Variance (CVV) goodness score is based on estimate that term j distinguishes one collection server from another; not an indication of quantity or quality of relevance

1- Server Ranking for Distributed Text Retrieval on the Internet

Page 5: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Server Ranking:Server Ranking: Merging Merging

Assumption 1: the best document in collection i is equally relevant to the best document in collection k

A collection server containing a few but highly relevant documents will contribute to the final list.

Assumption 2: the distance between two consecutive document ranks is inversely proportional to the goodness score

Relative goodness scores are roughly proportional to the number of documents contributed to the final list.

Final ranking is a combination of goodness score and local rankings.

1- Server Ranking for Distributed Text Retrieval on the Internet

Page 6: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Experiments:Experiments: (overview)… (overview)…

Problem: broker has no access to meta-data from isolated collection servers

Solution: choose collection server(s) based on results from previous training queries

2- Further Experiments with Database Merging

Page 7: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Experiments:Experiments: Server Server Selection, two approachesSelection, two approaches

Query Clustering (QC): cluster training queries (based on # of same documents retrieved) and calculate cluster “centroid vector”; compare query vector to centroid vector and assign weight to collection

Modeling Relevant Document Distributions (MRDD): find M most similar training queries and assign weights to collections based on the training run’s relevant document distribution

2- Further Experiments with Database Merging

Page 8: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

Experiments:Experiments: Merging Merging

N documents retrieved from each server as determined by weights

Final ranking is a random process: roll a C-faced die that is biased by the number of documents still to be picked from each of the C collections

2- Further Experiments with Database Merging

Page 9: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

ComparisonComparison

1-Server Ranking 2-Experiments

Broker’s Knowledge

Shared Document Frequency Data

Training Query Results

Collection Server Selection

CVV Goodness Scoring

Comparison to Training Queries

Merging Goodness Score

& Local Rank

Random

Page 10: Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further

ConclusionsConclusions

The server ranking method proposed by Yuwono and Lee is an effective way to minimize operating costs (such as time) in an environment where brokers and collection servers can share document frequency data.

The “isolated merging strategies” proposed by Vorhees is an effective way to choose a collection server where no meta-information is shared between the broker and collection server.