navigation aided retrieval

Navigation Aided Retrieval

Shashank Pandit & Christopher Olston

Carnegie Mellon & Yahoo

Search & Navigation Trends

Users often search and then supplement the search by extensively navigating beyond the search page to locate relevant information.

Why ? Query formulation problems Open ended search tasks Preference for orienteering

Search & Navigation Trends

User behaviour in IR tasks not often fully exploited by search engines ……….. Content based – words PageRank – in and out links for popularity Collaborative – clicks on results

Search engines do not examine these navigation patterns ………(they fail to mention SearchGuide – Coyle et al that does)

NAR – Navigation Aided Recommendation

New retrieval paradigm that incorporates post query user navigation as an explicit component – NAR

A query is seen as a means to identify starting points for further navigation by users

The starting points are presented to the user in a result-list and they permit easy navigation to many documents which match the users query

NAR Navigation retrieval with Organic structure

Structure naturally present in pre-existing web documents

Advantages Human oversight – human generated categories etc Familiar user Interface – list of documents (i.e. result-

list) Single view of document collection Robust implementation – no semantic knowledge

required

The model

D – set of documents in corpus, T - users search task ST – answer set for search task, QT- the set of valid queries

for task T

Query submodel – belief distribution for the answer set given a query. What is the likelihood that doc d solves the task - Relevance

Navigation submodel – likelihood that a user starting at a particular document will be able to navigate (under guidance) to a document that solves the task.

Conventional probabilistic IR Model

No outward navigation considered

Probability of solving the task depends on whether there is a document in the document collection which solves the task

Probability of the document solving a task is based on its “relevance” to the query

Navigation-Conscious Model

Considers browsing as part of the search task

Query submodel – any probabilistic IR relevance ranking model

Navigation submodel – Stochastic model of user navigation WUFIS (Chi et al)

WUFIS

W(N, d1, d2) - probability that a user with need N will

navigate from d1 to d2.• Scent provided by anchor and surrounding

text. • The probability of a link being followed is

related to how well a user’s need matches the scent – similarity between weighted vector of need terms and scent terms.

Final Model

Documents starting point score

= Query submodel X Navigation submodel

Dd

n dddNWqdRqd'

)',),'((),'(),(

Volant - Prototype

Volant - Preprocessing

Content Engine R(d,q) –estimated by Okapi DM25 scoring function

Connectivity Engine Estimates the probability of a user with need N(d2)

navigating from d1 to d2 starting with dw

Dijikstra’s algorithm used to generate tuples

)d ,d ), W(N(d,d ,d ,d 212w21

Volant – Starting points

Query entered -> ranked list of starting points

1. Retrieve from the content engine all documents, d’, that are relevant to the query

2. For each document retrieved from 1 retrieve from the connectivity engine all documents d for which W(N(d’),d,d’)>0

3. For each unique d, compute the starting point score.

4. Sort in decreasing order of starting point score

Volant – Navigation Guidance

When a user is navigation Volant intercepts the document and highlights links that lead to documents relevant to their query, q.

1. Retrieve from content engine all documents d’ that are relevant to q

2. For each d’ retrieved, get the documents that can lead to d from the connectivity engine i.e. W(N(d’),d,d’)>0

3. For each tuple retrieved in step 2 highlight the links that point to dw

Evaluation

Hypothesis1. In query only scenarios Volant does not perform

significantly worse that conventional approaches

2. In combined query/navigation scenarios Volant selects high-quality starting points.

3. In a significant fraction of query navigation scenarios the best organic starting point is of higher quality than the one that can be synthesized using existing techniques.

Search Task Test Sets

Navigation prone scenarios are difficult to predict. Simplified Clarity Score was used to determine a set of ambiguous and unambiguous queries

Unambiguous – 20 search tasks with highest clarity from Trek 2000

Ambiguous - 48 randomly selected tasks from Trek 2003

Performance on Unambiguous Queries

Mean Average Precision

No significant difference Why? Relevant documents tended not to be

siblings or close cousins so Volant deemed that the best starting points were the documents themselves.

Performance on Ambiguous Queries

User study – 48 judges judge the suitability of starting documents as starting points

30 starting points generated 10 Trec winner 2003 CSIRO 10 Volant with user guidance 10 (same as first 10 Volant) Volant without user

guidance

Performance on Ambiguous Queries

Rating criteria Breadth – spectrum of people, different interests Accessibility – how easy to navigate and find info Appeal – presentation of material Usefulness – would people be able to complete

their task from this point.

Each judge spent 5 hours on their task

Results

Summary & Future Work

Effectiveness – responds to users and positions them at suitable starting point for their task, guides them to further information in a query driven fashion.

Relationship to conventional IR – generalizes conventional probabilistic IR model and is successful in scenarios where IR techniques fail – ambiguous queries etc

Discussion

Cold Start Problem

Scalability

Bias in Evaluation

navigation aided retrieval

Documents

navigation patterns

easy navigation

post query user navigation

search taskquery submodel

search taskspreference

search page

particular document

query results