sandeep pandey 1 , sourashis roy 2 , christopher olston 1 , junghoo cho 2 , soumen chakrabarti 3

Post on 31-Dec-2015

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results. Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3. 1 Carnegie Mellon 2 UCLA 3 IIT Bombay. --------- --------- ---------. - PowerPoint PPT Presentation

TRANSCRIPT

Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3

1 Carnegie Mellon2 UCLA 3 IIT Bombay

Shuffling a Stacked Deck

The Case for Partially Randomized Ranking of Search Engine Results

@Carnegie MellonDatabases

Popularity as a Surrogate for Quality

Search engines want to measure the “quality” of pages

Quality hard to define and measure

Various “popularity” measures are used in ranking– e.g., in-links, PageRank, user traffic

1. ---------2. ---------3. ---------

@Carnegie MellonDatabases

Relationship Between Popularity and Quality

Popularity : depends on the number of users who “like” a page– relies on both awareness and quality of the page

Popularity correlated with quality

– when awareness is large

@Carnegie MellonDatabases

Problem

Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular

due to lack of user awareness

Plus, process of gaining popularity inhibited by “entrenchment effect”

@Carnegie MellonDatabases

Entrenchment Effect

Search engines show entrenched (already-popular) pages at the top

Users discover pages via search engines; tend to focus on top results

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

entrenched pages

user attention

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

@Carnegie MellonDatabases

Evidences of the Entrenchment

More news, less diversity

- New York Times

Do search engines suppress controversy?

- Susan L. Gerhart

GooglearchyDistinction of

retrievability and visibility

Bias on the Web- Comm. of the ACM

Are search engines biased?

- Chris Sherman

The politics of search engines

- IEEE Computer

The political economy of linking on the Web

-ACM conf. on Hypertext & Hypermedia

@Carnegie MellonDatabases

Quantification of Entrenchment Effect

Impact of Search Engines on Page Popularity– Real Web study by Cho et. al. [WWW’04]– Pages downloaded every week from 154 sites– Partitioned into 10 groups based on initial link

popularity– After 7 months,

70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages

@Carnegie MellonDatabases

Alternative Approaches to Counter-act Entrenchment Effect

Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02]– Proposed an age-based variant of PageRank

Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05]– Proposed an estimate of quality based on the

derivative of popularity

@Carnegie MellonDatabases

Our Approach: Randomized Rank Promotion

Select random (young) pages to promote to good rank positions

Rank position to promote to is chosen at random

1

2

3

500

501

..

1

500

2

499

501

..3

@Carnegie MellonDatabases

Our Approach: Randomized Rank Promotion

Consequence: Users visit promoted pages; improves quality estimate

Compared with previous approaches: • Does not rely on temporal measurements (+)• Sub-optimal (-)

@Carnegie MellonDatabases

Exploration/Exploitation Tradeoff

Exploration/Exploitation tradeoff– exploit known high-quality pages by assigning

good rank positions– explore quality of new pages by promoting them

in rank

Existing search engines only exploit (to our knowledge)

@Carnegie MellonDatabases

Possible Objectives for Rank Promotion

Fairness– Give each page an equal chance to become popular– Incentive for search engines to be fair?

Quality– Maximize quality of search results seen by users (in

aggregate)– Quality page p: extent to which users “like” p– Q(p) [0,1]

our choice

@Carnegie MellonDatabases

Model of the Web

Squash Linux

Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.)

A community is made up of a set of pages, interested users and related queries

@Carnegie MellonDatabases

Model of the Web

Users visit pages only by issuing queries to search engine

– Mixed surfing & searching considered in the paper

Query answer = ordered list containing all pages in the corresponding community

A single ranked list associated with each community– Since queries within a community are very similar

@Carnegie MellonDatabases

Model of the Web

Consequence: Each community evolves independent of the other communities

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

Community on Squash Community on Linux

@Carnegie MellonDatabases

Quality-Per-Click Metric (QPC)

V(p,t) : number of visits to page p at time t

QPC : average quality of pages viewed by users, amortized over time

t

j Pp

t

j Pp

ttpV

pQtpV

QPC

0

0

),(

)(),(

lim

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

@Carnegie MellonDatabases

Desiderata for Randomized Rank Promotion

Want ability to:– Control exploration/exploitation

tradeoff

– “Select” certain pages as candidates for promotion

– – “Protect’’ certain pages from

demotion

1

2

3

500

501

..

1

500

2

499

501

..3

@Carnegie MellonDatabases

Randomized Rank Promotion Scheme

WWm

W-Wm

Promotion pool

4

1

2

3

4

1

2

3random ordering

order by popularity Ld

Lm

Remainder

@Carnegie MellonDatabases

Randomized Rank Promotion Scheme

Ld

k-1

r 1-r

Promotion list

k = 3 r = 0.5

Remainder

1

1 2

2 3 4

3 4 5 6

1 2Lm

@Carnegie MellonDatabases

Parameters

Promotion pool (Wm)– Uniform rank promotion : give an equal chance to each

page– Selective rank promotion : exclusively target zero

awareness pages

Start rank (k)– rank to start randomization from

Degree of randomization (r) – controls the tradeoff between exploration and exploitation

@Carnegie MellonDatabases

Tuning the Parameters

Objective: maximize quality-per-click (QPC)

Entrenchment in a community depends on many factors

– Number of pages and users– Page lifetimes– Visits per user

Two ways to tune– set parameters per community– one parameter setting for all communities

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

@Carnegie MellonDatabases

Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)

@Carnegie MellonDatabases

Popularity to Rank Relationship

Rank of a page under randomized rank promotion scheme

– determined by a combination of popularity and randomness

Deterministic Popularity-based-ranking is a special case– i.e., r=0

Unknown function FPR : rank as a function of the popularity of page p under a given randomized scheme

R(p,t) = FPR(P(p,t))

DETAIL

@Carnegie MellonDatabases

Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Pro

bab

ilit

y o

f V

iew

ing

vie

w p

rob

abi

lity

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

FRV(r) r –1.5

DETAIL

@Carnegie MellonDatabases

Visit to Awareness Relationship

Awareness A(p,t) : fraction of users who have visited page p at least once by time t

t

dttpV

mtpA 0

),(1

11),(

DETAIL

@Carnegie MellonDatabases

Awareness to Popularity Relationship

Quality Q(p) : extent to which users like page p (contribute towards its popularity)

Popularity P(p,t) :

)().,(),( pQtpAtpP

DETAIL

@Carnegie MellonDatabases

Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)

FAP(A(p,t))

FVA(V(p,t))

FPR(P(p,t))

FRV(R(p,t))

@Carnegie MellonDatabases

Deriving Popularity Evolution Curve

Po

pu

lari

ty

P(p

,t)

time (t)

Next step : derive formula for popularity evolution curve

Derive it using the awareness distribution of pages

@Carnegie MellonDatabases

Deriving Popularity Evolution Curve

Assumptions– number of pages constant– Pages are created and retired according to a Poisson

process with rate parameter – Quality distribution of pages is stationary

In the steady state, both popularity and awareness distribution of the pages are stationary

@Carnegie MellonDatabases

Popularity Evolution Curve and Awareness Distribution

Popularity Evolution Curve E(x,q) : time duration for

which a page of quality q has popularity value x

Next : derive popularity evolution curve using the

awareness distribution

)|( qaf i Awareness distribution : fraction of pages of quality q whose awareness is i / (#users)

DETAIL

@Carnegie MellonDatabases

Popularity Evolution Curve and Awareness Distribution

: interpret it as the probability of a page of

quality q to have awareness ai at any point of time)|( qaf i

We know that :

)().,(),( pQtpAtpP

qaf

qaqE ii

|),.(

Hence,

DETAIL

@Carnegie MellonDatabases

Deriving Awareness Distribution

)|( qaf i : fraction of pages of quality q whose

awareness is i / (#users)

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1)).(0(()|(

but remember that we do not know FPR yet

R(p,t) = FPR(P(p,t))

Doing the steady state analysis, we get

DETAIL

@Carnegie MellonDatabases

Deriving Awareness Distribution

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1))).(0((()|(

)|( qaf i

Start with an initial form of FPR ; iterate till convergence

Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below)

Pp

m

pQxmiPR pQ

m

ifxF

)(/.1

)(|1)(

DETAIL

@Carnegie MellonDatabases

Summary of Where We Stand

Formalized the popularity evolution cycle– Relationship between popularity evolution and

awareness distribution– Derived the awareness distribution

Next step: tune parameters

Recall, goal is to obtain scheme that:1. achieves high QPC (quality per click)2. is robust across a wide range of community types

@Carnegie MellonDatabases

Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:

– Number of pages and users– Page lifetimes– Visits per user

@Carnegie MellonDatabases

Default Community Setting

Number of pages = 10,000 *

Number of users = 1000

Visits per user = 1000 visits per day

Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ]

* How Much Information? SIMS, Berkeley, 2003

@Carnegie MellonDatabases

Tuning: Wm parameter

-no promotion - uniform promotion- selective promotion

k=1 and r=0.2

@Carnegie MellonDatabases

Tuning: k and r

• Optimal r: (0,1)

• Optimal r increases with increasing k

Based on simulation(reason: analysis only accurate for small values of r)

@Carnegie MellonDatabases

Tuning: k and r

Deciding k & r :

– k >= 2 for “feeling lucky”

– Minimize amount of “junk” perceived

– Maximize QPC

@Carnegie MellonDatabases

Final Parameter Settings

Promotion pool (Wm ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1

@Carnegie MellonDatabases

Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:

– Number of pages and users– Page lifetimes– Visits per user

@Carnegie MellonDatabases

Influence of Number of Pages and Users

@Carnegie MellonDatabases

Influence of Page Lifetime and Visit rate

@Carnegie MellonDatabases

Influence of Visit Rate

1000 visits/day

per user

@Carnegie MellonDatabases

Summary

Entrenchment effect hurts search result quality

Solution: Randomized rank promotion

Model of Web evolution and QPC metric– Used to tune & evaluate randomized rank promotion

Initial results– Significantly increases QPC– Robust across wide range of Web communities

More study required

@Carnegie MellonDatabases

THE END

Paper available at :www.cs.cmu.edu/~spandey

top related