sandeep pandey 1 , sourashis roy 2 , christopher olston 1 , junghoo cho 2 , soumen chakrabarti 3

Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3

1 Carnegie Mellon2 UCLA 3 IIT Bombay

Shuffling a Stacked Deck

The Case for Partially Randomized Ranking of Search Engine Results

@Carnegie MellonDatabases

Popularity as a Surrogate for Quality

Search engines want to measure the “quality” of pages

Quality hard to define and measure

Various “popularity” measures are used in ranking– e.g., in-links, PageRank, user traffic

1. ---------2. ---------3. ---------


Relationship Between Popularity and Quality

Popularity : depends on the number of users who “like” a page– relies on both awareness and quality of the page

Popularity correlated with quality

– when awareness is large


Problem

Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular

due to lack of user awareness

Plus, process of gaining popularity inhibited by “entrenchment effect”


Entrenchment Effect

Search engines show entrenched (already-popular) pages at the top

Users discover pages via search engines; tend to focus on top results

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

entrenched pages

user attention


Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary


Evidences of the Entrenchment

More news, less diversity

- New York Times

Do search engines suppress controversy?

- Susan L. Gerhart

GooglearchyDistinction of

retrievability and visibility

Bias on the Web- Comm. of the ACM

Are search engines biased?

- Chris Sherman

The politics of search engines

- IEEE Computer

The political economy of linking on the Web

-ACM conf. on Hypertext & Hypermedia


Quantification of Entrenchment Effect

Impact of Search Engines on Page Popularity– Real Web study by Cho et. al. [WWW’04]– Pages downloaded every week from 154 sites– Partitioned into 10 groups based on initial link

popularity– After 7 months,

70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages


Alternative Approaches to Counter-act Entrenchment Effect

Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02]– Proposed an age-based variant of PageRank

Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05]– Proposed an estimate of quality based on the

derivative of popularity


Our Approach: Randomized Rank Promotion

Select random (young) pages to promote to good rank positions

Rank position to promote to is chosen at random

1

2

3

500

501

..

1

500

2

499

501

..3


Our Approach: Randomized Rank Promotion

Consequence: Users visit promoted pages; improves quality estimate

Compared with previous approaches: • Does not rely on temporal measurements (+)• Sub-optimal (-)


Exploration/Exploitation Tradeoff

Exploration/Exploitation tradeoff– exploit known high-quality pages by assigning

good rank positions– explore quality of new pages by promoting them

in rank

Existing search engines only exploit (to our knowledge)


Possible Objectives for Rank Promotion

Fairness– Give each page an equal chance to become popular– Incentive for search engines to be fair?

Quality– Maximize quality of search results seen by users (in

aggregate)– Quality page p: extent to which users “like” p– Q(p) [0,1]

our choice


Model of the Web

Squash Linux

Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.)

A community is made up of a set of pages, interested users and related queries


Model of the Web

Users visit pages only by issuing queries to search engine

– Mixed surfing & searching considered in the paper

Query answer = ordered list containing all pages in the corresponding community

A single ranked list associated with each community– Since queries within a community are very similar


Model of the Web

Consequence: Each community evolves independent of the other communities

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

Community on Squash Community on Linux


Quality-Per-Click Metric (QPC)

V(p,t) : number of visits to page p at time t

QPC : average quality of pages viewed by users, amortized over time

t

j Pp

t

j Pp

ttpV

pQtpV

QPC

0

0

),(

)(),(

lim


Outline



Summary


Desiderata for Randomized Rank Promotion

Want ability to:– Control exploration/exploitation

tradeoff

– “Select” certain pages as candidates for promotion

– – “Protect’’ certain pages from

demotion

1

2

3

500

501

..

1

500

2

499

501

..3


Randomized Rank Promotion Scheme

WWm

W-Wm

Promotion pool

4

1

2

3

4

1

2

3random ordering

order by popularity Ld

Lm

Remainder


Randomized Rank Promotion Scheme

Ld

k-1

r 1-r

Promotion list

k = 3 r = 0.5

Remainder

1

1 2

2 3 4

3 4 5 6

1 2Lm


Parameters

Promotion pool (Wm)– Uniform rank promotion : give an equal chance to each

page– Selective rank promotion : exclusively target zero

awareness pages

Start rank (k)– rank to start randomization from

Degree of randomization (r) – controls the tradeoff between exploration and exploitation


Tuning the Parameters

Objective: maximize quality-per-click (QPC)

Entrenchment in a community depends on many factors

– Number of pages and users– Page lifetimes– Visits per user

Two ways to tune– set parameters per community– one parameter setting for all communities


Outline



Summary


Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)


Popularity to Rank Relationship

Rank of a page under randomized rank promotion scheme

– determined by a combination of popularity and randomness

Deterministic Popularity-based-ranking is a special case– i.e., r=0

Unknown function FPR : rank as a function of the popularity of page p under a given randomized scheme

R(p,t) = FPR(P(p,t))

DETAIL


Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Pro

bab

ilit

y o

f V

iew

ing

vie

w p

rob

abi

lity

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

FRV(r) r –1.5

DETAIL


Visit to Awareness Relationship

Awareness A(p,t) : fraction of users who have visited page p at least once by time t

t

dttpV

mtpA 0

),(1

11),(

DETAIL


Awareness to Popularity Relationship

Quality Q(p) : extent to which users like page p (contribute towards its popularity)

Popularity P(p,t) :

)().,(),( pQtpAtpP

DETAIL


Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)

FAP(A(p,t))

FVA(V(p,t))

FPR(P(p,t))

FRV(R(p,t))


Deriving Popularity Evolution Curve

Po

pu

lari

ty

P(p

,t)

time (t)

Next step : derive formula for popularity evolution curve

Derive it using the awareness distribution of pages


Deriving Popularity Evolution Curve

Assumptions– number of pages constant– Pages are created and retired according to a Poisson

process with rate parameter – Quality distribution of pages is stationary

In the steady state, both popularity and awareness distribution of the pages are stationary


Popularity Evolution Curve and Awareness Distribution

Popularity Evolution Curve E(x,q) : time duration for

which a page of quality q has popularity value x

Next : derive popularity evolution curve using the

awareness distribution

)|( qaf i Awareness distribution : fraction of pages of quality q whose awareness is i / (#users)

DETAIL


Popularity Evolution Curve and Awareness Distribution

: interpret it as the probability of a page of

quality q to have awareness ai at any point of time)|( qaf i

We know that :

)().,(),( pQtpAtpP

qaf

qaqE ii

|),.(

Hence,

DETAIL


Deriving Awareness Distribution

)|( qaf i : fraction of pages of quality q whose

awareness is i / (#users)

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1)).(0(()|(

but remember that we do not know FPR yet

R(p,t) = FPR(P(p,t))

Doing the steady state analysis, we get

DETAIL


Deriving Awareness Distribution

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1))).(0((()|(

)|( qaf i

Start with an initial form of FPR ; iterate till convergence

Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below)

Pp

m

pQxmiPR pQ

m

ifxF

)(/.1

)(|1)(

DETAIL


Summary of Where We Stand

Formalized the popularity evolution cycle– Relationship between popularity evolution and

awareness distribution– Derived the awareness distribution

Next step: tune parameters

Recall, goal is to obtain scheme that:1. achieves high QPC (quality per click)2. is robust across a wide range of community types


Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:



Default Community Setting

Number of pages = 10,000 *

Number of users = 1000

Visits per user = 1000 visits per day

Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ]

* How Much Information? SIMS, Berkeley, 2003


Tuning: Wm parameter

-no promotion - uniform promotion- selective promotion

k=1 and r=0.2


Tuning: k and r

• Optimal r: (0,1)

• Optimal r increases with increasing k

Based on simulation(reason: analysis only accurate for small values of r)


Tuning: k and r

Deciding k & r :

– k >= 2 for “feeling lucky”

– Minimize amount of “junk” perceived

– Maximize QPC


Final Parameter Settings

Promotion pool (Wm ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1


Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:



Influence of Number of Pages and Users


Influence of Page Lifetime and Visit rate


Influence of Visit Rate

1000 visits/day

per user


Summary

Entrenchment effect hurts search result quality

Solution: Randomized rank promotion

Model of Web evolution and QPC metric– Used to tune & evaluate randomized rank promotion

Initial results– Significantly increases QPC– Robust across wide range of Web communities

More study required


THE END

Paper available at :www.cs.cmu.edu/~spandey

sandeep pandey 1 , sourashis roy 2 , christopher olston 1 , junghoo cho 2 , soumen chakrabarti 3

Documents