sandeep pandey 1 , sourashis roy 2 , christopher olston 1 , junghoo cho 2 , soumen chakrabarti 3
DESCRIPTION
Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results. Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3. 1 Carnegie Mellon 2 UCLA 3 IIT Bombay. --------- --------- ---------. - PowerPoint PPT PresentationTRANSCRIPT
Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3
1 Carnegie Mellon2 UCLA 3 IIT Bombay
Shuffling a Stacked Deck
The Case for Partially Randomized Ranking of Search Engine Results
@Carnegie MellonDatabases
Popularity as a Surrogate for Quality
Search engines want to measure the “quality” of pages
Quality hard to define and measure
Various “popularity” measures are used in ranking– e.g., in-links, PageRank, user traffic
1. ---------2. ---------3. ---------
@Carnegie MellonDatabases
Relationship Between Popularity and Quality
Popularity : depends on the number of users who “like” a page– relies on both awareness and quality of the page
Popularity correlated with quality
– when awareness is large
@Carnegie MellonDatabases
Problem
Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular
due to lack of user awareness
Plus, process of gaining popularity inhibited by “entrenchment effect”
@Carnegie MellonDatabases
Entrenchment Effect
Search engines show entrenched (already-popular) pages at the top
Users discover pages via search engines; tend to focus on top results
1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …
entrenched pages
user attention
@Carnegie MellonDatabases
Outline
Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by
introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation
Summary
@Carnegie MellonDatabases
Evidences of the Entrenchment
More news, less diversity
- New York Times
Do search engines suppress controversy?
- Susan L. Gerhart
GooglearchyDistinction of
retrievability and visibility
Bias on the Web- Comm. of the ACM
Are search engines biased?
- Chris Sherman
The politics of search engines
- IEEE Computer
The political economy of linking on the Web
-ACM conf. on Hypertext & Hypermedia
@Carnegie MellonDatabases
Quantification of Entrenchment Effect
Impact of Search Engines on Page Popularity– Real Web study by Cho et. al. [WWW’04]– Pages downloaded every week from 154 sites– Partitioned into 10 groups based on initial link
popularity– After 7 months,
70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages
@Carnegie MellonDatabases
Alternative Approaches to Counter-act Entrenchment Effect
Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02]– Proposed an age-based variant of PageRank
Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05]– Proposed an estimate of quality based on the
derivative of popularity
@Carnegie MellonDatabases
Our Approach: Randomized Rank Promotion
Select random (young) pages to promote to good rank positions
Rank position to promote to is chosen at random
1
2
3
500
501
..
1
500
2
499
501
..3
@Carnegie MellonDatabases
Our Approach: Randomized Rank Promotion
Consequence: Users visit promoted pages; improves quality estimate
Compared with previous approaches: • Does not rely on temporal measurements (+)• Sub-optimal (-)
@Carnegie MellonDatabases
Exploration/Exploitation Tradeoff
Exploration/Exploitation tradeoff– exploit known high-quality pages by assigning
good rank positions– explore quality of new pages by promoting them
in rank
Existing search engines only exploit (to our knowledge)
@Carnegie MellonDatabases
Possible Objectives for Rank Promotion
Fairness– Give each page an equal chance to become popular– Incentive for search engines to be fair?
Quality– Maximize quality of search results seen by users (in
aggregate)– Quality page p: extent to which users “like” p– Q(p) [0,1]
our choice
@Carnegie MellonDatabases
Model of the Web
Squash Linux
Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.)
A community is made up of a set of pages, interested users and related queries
@Carnegie MellonDatabases
Model of the Web
Users visit pages only by issuing queries to search engine
– Mixed surfing & searching considered in the paper
Query answer = ordered list containing all pages in the corresponding community
A single ranked list associated with each community– Since queries within a community are very similar
@Carnegie MellonDatabases
Model of the Web
Consequence: Each community evolves independent of the other communities
1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …
1. ---------2. ---------3. ---------4. --------- 5. ---------6. --------- …
Community on Squash Community on Linux
@Carnegie MellonDatabases
Quality-Per-Click Metric (QPC)
V(p,t) : number of visits to page p at time t
QPC : average quality of pages viewed by users, amortized over time
t
j Pp
t
j Pp
ttpV
pQtpV
QPC
0
0
),(
)(),(
lim
@Carnegie MellonDatabases
Outline
Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by
introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation
Summary
@Carnegie MellonDatabases
Desiderata for Randomized Rank Promotion
Want ability to:– Control exploration/exploitation
tradeoff
– “Select” certain pages as candidates for promotion
– – “Protect’’ certain pages from
demotion
1
2
3
500
501
..
1
500
2
499
501
..3
@Carnegie MellonDatabases
Randomized Rank Promotion Scheme
WWm
W-Wm
Promotion pool
4
1
2
3
4
1
2
3random ordering
order by popularity Ld
Lm
Remainder
@Carnegie MellonDatabases
Randomized Rank Promotion Scheme
Ld
k-1
r 1-r
Promotion list
k = 3 r = 0.5
Remainder
1
1 2
2 3 4
3 4 5 6
1 2Lm
@Carnegie MellonDatabases
Parameters
Promotion pool (Wm)– Uniform rank promotion : give an equal chance to each
page– Selective rank promotion : exclusively target zero
awareness pages
Start rank (k)– rank to start randomization from
Degree of randomization (r) – controls the tradeoff between exploration and exploitation
@Carnegie MellonDatabases
Tuning the Parameters
Objective: maximize quality-per-click (QPC)
Entrenchment in a community depends on many factors
– Number of pages and users– Page lifetimes– Visits per user
Two ways to tune– set parameters per community– one parameter setting for all communities
@Carnegie MellonDatabases
Outline
Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by
introducing randomness into ranking – Model of ranking and popularity evolution– Evaluation
Summary
@Carnegie MellonDatabases
Popularity Evolution Cycle
Popularity P(p,t)
Rank R(p,t)
Awareness A(p,t)
Visit rateV(p,t)
@Carnegie MellonDatabases
Popularity to Rank Relationship
Rank of a page under randomized rank promotion scheme
– determined by a combination of popularity and randomness
Deterministic Popularity-based-ranking is a special case– i.e., r=0
Unknown function FPR : rank as a function of the popularity of page p under a given randomized scheme
R(p,t) = FPR(P(p,t))
DETAIL
@Carnegie MellonDatabases
Viewing Likelihood
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150
Rank
Pro
bab
ilit
y o
f V
iew
ing
vie
w p
rob
abi
lity
rank
• Depends primarily on rank in list [Joachims KDD’02]
• From AltaVista data [Lempel et al. WWW’03]:
FRV(r) r –1.5
DETAIL
@Carnegie MellonDatabases
Visit to Awareness Relationship
Awareness A(p,t) : fraction of users who have visited page p at least once by time t
t
dttpV
mtpA 0
),(1
11),(
DETAIL
@Carnegie MellonDatabases
Awareness to Popularity Relationship
Quality Q(p) : extent to which users like page p (contribute towards its popularity)
Popularity P(p,t) :
)().,(),( pQtpAtpP
DETAIL
@Carnegie MellonDatabases
Popularity Evolution Cycle
Popularity P(p,t)
Rank R(p,t)
Awareness A(p,t)
Visit rateV(p,t)
FAP(A(p,t))
FVA(V(p,t))
FPR(P(p,t))
FRV(R(p,t))
@Carnegie MellonDatabases
Deriving Popularity Evolution Curve
Po
pu
lari
ty
P(p
,t)
time (t)
Next step : derive formula for popularity evolution curve
Derive it using the awareness distribution of pages
@Carnegie MellonDatabases
Deriving Popularity Evolution Curve
Assumptions– number of pages constant– Pages are created and retired according to a Poisson
process with rate parameter – Quality distribution of pages is stationary
In the steady state, both popularity and awareness distribution of the pages are stationary
@Carnegie MellonDatabases
Popularity Evolution Curve and Awareness Distribution
Popularity Evolution Curve E(x,q) : time duration for
which a page of quality q has popularity value x
Next : derive popularity evolution curve using the
awareness distribution
)|( qaf i Awareness distribution : fraction of pages of quality q whose awareness is i / (#users)
DETAIL
@Carnegie MellonDatabases
Popularity Evolution Curve and Awareness Distribution
: interpret it as the probability of a page of
quality q to have awareness ai at any point of time)|( qaf i
We know that :
)().,(),( pQtpAtpP
qaf
qaqE ii
|),.(
Hence,
DETAIL
@Carnegie MellonDatabases
Deriving Awareness Distribution
)|( qaf i : fraction of pages of quality q whose
awareness is i / (#users)
i
j jPRRV
jPRRV
iPRRVi qaFF
qaFF
aFFqaf
1
1
)).((
)).((
)1)).(0(()|(
but remember that we do not know FPR yet
R(p,t) = FPR(P(p,t))
Doing the steady state analysis, we get
DETAIL
@Carnegie MellonDatabases
Deriving Awareness Distribution
i
j jPRRV
jPRRV
iPRRVi qaFF
qaFF
aFFqaf
1
1
)).((
)).((
)1))).(0((()|(
)|( qaf i
Start with an initial form of FPR ; iterate till convergence
Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below)
Pp
m
pQxmiPR pQ
m
ifxF
)(/.1
)(|1)(
DETAIL
@Carnegie MellonDatabases
Summary of Where We Stand
Formalized the popularity evolution cycle– Relationship between popularity evolution and
awareness distribution– Derived the awareness distribution
Next step: tune parameters
Recall, goal is to obtain scheme that:1. achieves high QPC (quality per click)2. is robust across a wide range of community types
@Carnegie MellonDatabases
Tuning the Promotion Scheme
Parameters: k, r and Wm
Objective: maximize QPC Influential factors:
– Number of pages and users– Page lifetimes– Visits per user
@Carnegie MellonDatabases
Default Community Setting
Number of pages = 10,000 *
Number of users = 1000
Visits per user = 1000 visits per day
Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ]
* How Much Information? SIMS, Berkeley, 2003
@Carnegie MellonDatabases
Tuning: Wm parameter
-no promotion - uniform promotion- selective promotion
k=1 and r=0.2
@Carnegie MellonDatabases
Tuning: k and r
• Optimal r: (0,1)
• Optimal r increases with increasing k
Based on simulation(reason: analysis only accurate for small values of r)
@Carnegie MellonDatabases
Tuning: k and r
Deciding k & r :
– k >= 2 for “feeling lucky”
– Minimize amount of “junk” perceived
– Maximize QPC
@Carnegie MellonDatabases
Final Parameter Settings
Promotion pool (Wm ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1
@Carnegie MellonDatabases
Tuning the Promotion Scheme
Parameters: k, r and Wm
Objective: maximize QPC Influential factors:
– Number of pages and users– Page lifetimes– Visits per user
@Carnegie MellonDatabases
Influence of Number of Pages and Users
@Carnegie MellonDatabases
Influence of Page Lifetime and Visit rate
@Carnegie MellonDatabases
Influence of Visit Rate
1000 visits/day
per user
@Carnegie MellonDatabases
Summary
Entrenchment effect hurts search result quality
Solution: Randomized rank promotion
Model of Web evolution and QPC metric– Used to tune & evaluate randomized rank promotion
Initial results– Significantly increases QPC– Robust across wide range of Web communities
More study required
@Carnegie MellonDatabases
THE END
Paper available at :www.cs.cmu.edu/~spandey