sandeep pandey 1 , sourashis roy 2 , christopher olston 1 , junghoo cho 2 , soumen chakrabarti 3

49
Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results

Upload: baker-vang

Post on 31-Dec-2015

26 views

Category:

Documents


0 download

DESCRIPTION

Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results. Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3. 1 Carnegie Mellon 2 UCLA 3 IIT Bombay. --------- --------- ---------. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3

1 Carnegie Mellon2 UCLA 3 IIT Bombay

Shuffling a Stacked Deck

The Case for Partially Randomized Ranking of Search Engine Results

Page 2: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity as a Surrogate for Quality

Search engines want to measure the “quality” of pages

Quality hard to define and measure

Various “popularity” measures are used in ranking– e.g., in-links, PageRank, user traffic

1. ---------2. ---------3. ---------

Page 3: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Relationship Between Popularity and Quality

Popularity : depends on the number of users who “like” a page– relies on both awareness and quality of the page

Popularity correlated with quality

– when awareness is large

Page 4: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Problem

Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular

due to lack of user awareness

Plus, process of gaining popularity inhibited by “entrenchment effect”

Page 5: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Entrenchment Effect

Search engines show entrenched (already-popular) pages at the top

Users discover pages via search engines; tend to focus on top results

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

entrenched pages

user attention

Page 6: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

Page 7: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Evidences of the Entrenchment

More news, less diversity

- New York Times

Do search engines suppress controversy?

- Susan L. Gerhart

GooglearchyDistinction of

retrievability and visibility

Bias on the Web- Comm. of the ACM

Are search engines biased?

- Chris Sherman

The politics of search engines

- IEEE Computer

The political economy of linking on the Web

-ACM conf. on Hypertext & Hypermedia

Page 8: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Quantification of Entrenchment Effect

Impact of Search Engines on Page Popularity– Real Web study by Cho et. al. [WWW’04]– Pages downloaded every week from 154 sites– Partitioned into 10 groups based on initial link

popularity– After 7 months,

70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages

Page 9: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Alternative Approaches to Counter-act Entrenchment Effect

Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02]– Proposed an age-based variant of PageRank

Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05]– Proposed an estimate of quality based on the

derivative of popularity

Page 10: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Our Approach: Randomized Rank Promotion

Select random (young) pages to promote to good rank positions

Rank position to promote to is chosen at random

1

2

3

500

501

..

1

500

2

499

501

..3

Page 11: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Our Approach: Randomized Rank Promotion

Consequence: Users visit promoted pages; improves quality estimate

Compared with previous approaches: • Does not rely on temporal measurements (+)• Sub-optimal (-)

Page 12: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Exploration/Exploitation Tradeoff

Exploration/Exploitation tradeoff– exploit known high-quality pages by assigning

good rank positions– explore quality of new pages by promoting them

in rank

Existing search engines only exploit (to our knowledge)

Page 13: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Possible Objectives for Rank Promotion

Fairness– Give each page an equal chance to become popular– Incentive for search engines to be fair?

Quality– Maximize quality of search results seen by users (in

aggregate)– Quality page p: extent to which users “like” p– Q(p) [0,1]

our choice

Page 14: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Model of the Web

Squash Linux

Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.)

A community is made up of a set of pages, interested users and related queries

Page 15: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Model of the Web

Users visit pages only by issuing queries to search engine

– Mixed surfing & searching considered in the paper

Query answer = ordered list containing all pages in the corresponding community

A single ranked list associated with each community– Since queries within a community are very similar

Page 16: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Model of the Web

Consequence: Each community evolves independent of the other communities

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …

Community on Squash Community on Linux

Page 17: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Quality-Per-Click Metric (QPC)

V(p,t) : number of visits to page p at time t

QPC : average quality of pages viewed by users, amortized over time

t

j Pp

t

j Pp

ttpV

pQtpV

QPC

0

0

),(

)(),(

lim

Page 18: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

Page 19: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Desiderata for Randomized Rank Promotion

Want ability to:– Control exploration/exploitation

tradeoff

– “Select” certain pages as candidates for promotion

– – “Protect’’ certain pages from

demotion

1

2

3

500

501

..

1

500

2

499

501

..3

Page 20: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Randomized Rank Promotion Scheme

WWm

W-Wm

Promotion pool

4

1

2

3

4

1

2

3random ordering

order by popularity Ld

Lm

Remainder

Page 21: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Randomized Rank Promotion Scheme

Ld

k-1

r 1-r

Promotion list

k = 3 r = 0.5

Remainder

1

1 2

2 3 4

3 4 5 6

1 2Lm

Page 22: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Parameters

Promotion pool (Wm)– Uniform rank promotion : give an equal chance to each

page– Selective rank promotion : exclusively target zero

awareness pages

Start rank (k)– rank to start randomization from

Degree of randomization (r) – controls the tradeoff between exploration and exploitation

Page 23: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning the Parameters

Objective: maximize quality-per-click (QPC)

Entrenchment in a community depends on many factors

– Number of pages and users– Page lifetimes– Visits per user

Two ways to tune– set parameters per community– one parameter setting for all communities

Page 24: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Outline

Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by

introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation

Summary

Page 25: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)

Page 26: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity to Rank Relationship

Rank of a page under randomized rank promotion scheme

– determined by a combination of popularity and randomness

Deterministic Popularity-based-ranking is a special case– i.e., r=0

Unknown function FPR : rank as a function of the popularity of page p under a given randomized scheme

R(p,t) = FPR(P(p,t))

DETAIL

Page 27: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Viewing Likelihood

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150

Rank

Pro

bab

ilit

y o

f V

iew

ing

vie

w p

rob

abi

lity

rank

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

FRV(r) r –1.5

DETAIL

Page 28: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Visit to Awareness Relationship

Awareness A(p,t) : fraction of users who have visited page p at least once by time t

t

dttpV

mtpA 0

),(1

11),(

DETAIL

Page 29: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Awareness to Popularity Relationship

Quality Q(p) : extent to which users like page p (contribute towards its popularity)

Popularity P(p,t) :

)().,(),( pQtpAtpP

DETAIL

Page 30: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity Evolution Cycle

Popularity P(p,t)

Rank R(p,t)

Awareness A(p,t)

Visit rateV(p,t)

FAP(A(p,t))

FVA(V(p,t))

FPR(P(p,t))

FRV(R(p,t))

Page 31: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Deriving Popularity Evolution Curve

Po

pu

lari

ty

P(p

,t)

time (t)

Next step : derive formula for popularity evolution curve

Derive it using the awareness distribution of pages

Page 32: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Deriving Popularity Evolution Curve

Assumptions– number of pages constant– Pages are created and retired according to a Poisson

process with rate parameter – Quality distribution of pages is stationary

In the steady state, both popularity and awareness distribution of the pages are stationary

Page 33: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity Evolution Curve and Awareness Distribution

Popularity Evolution Curve E(x,q) : time duration for

which a page of quality q has popularity value x

Next : derive popularity evolution curve using the

awareness distribution

)|( qaf i Awareness distribution : fraction of pages of quality q whose awareness is i / (#users)

DETAIL

Page 34: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Popularity Evolution Curve and Awareness Distribution

: interpret it as the probability of a page of

quality q to have awareness ai at any point of time)|( qaf i

We know that :

)().,(),( pQtpAtpP

qaf

qaqE ii

|),.(

Hence,

DETAIL

Page 35: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Deriving Awareness Distribution

)|( qaf i : fraction of pages of quality q whose

awareness is i / (#users)

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1)).(0(()|(

but remember that we do not know FPR yet

R(p,t) = FPR(P(p,t))

Doing the steady state analysis, we get

DETAIL

Page 36: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Deriving Awareness Distribution

i

j jPRRV

jPRRV

iPRRVi qaFF

qaFF

aFFqaf

1

1

)).((

)).((

)1))).(0((()|(

)|( qaf i

Start with an initial form of FPR ; iterate till convergence

Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below)

Pp

m

pQxmiPR pQ

m

ifxF

)(/.1

)(|1)(

DETAIL

Page 37: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Summary of Where We Stand

Formalized the popularity evolution cycle– Relationship between popularity evolution and

awareness distribution– Derived the awareness distribution

Next step: tune parameters

Recall, goal is to obtain scheme that:1. achieves high QPC (quality per click)2. is robust across a wide range of community types

Page 38: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:

– Number of pages and users– Page lifetimes– Visits per user

Page 39: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Default Community Setting

Number of pages = 10,000 *

Number of users = 1000

Visits per user = 1000 visits per day

Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ]

* How Much Information? SIMS, Berkeley, 2003

Page 40: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning: Wm parameter

-no promotion - uniform promotion- selective promotion

k=1 and r=0.2

Page 41: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning: k and r

• Optimal r: (0,1)

• Optimal r increases with increasing k

Based on simulation(reason: analysis only accurate for small values of r)

Page 42: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning: k and r

Deciding k & r :

– k >= 2 for “feeling lucky”

– Minimize amount of “junk” perceived

– Maximize QPC

Page 43: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Final Parameter Settings

Promotion pool (Wm ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1

Page 44: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Tuning the Promotion Scheme

Parameters: k, r and Wm

Objective: maximize QPC Influential factors:

– Number of pages and users– Page lifetimes– Visits per user

Page 45: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Influence of Number of Pages and Users

Page 46: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Influence of Page Lifetime and Visit rate

Page 47: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Influence of Visit Rate

1000 visits/day

per user

Page 48: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

Summary

Entrenchment effect hurts search result quality

Solution: Randomized rank promotion

Model of Web evolution and QPC metric– Used to tune & evaluate randomized rank promotion

Initial results– Significantly increases QPC– Robust across wide range of Web communities

More study required

Page 49: Sandeep Pandey 1 ,  Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

@Carnegie MellonDatabases

THE END

Paper available at :www.cs.cmu.edu/~spandey