discovering bucket orders from full rankings
DESCRIPTION
Discovering Bucket Orders from Full Rankings. Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology. * Work done at UIUC. - PowerPoint PPT PresentationTRANSCRIPT
Discovering Bucket Orders from Full Rankings
Jianlin Feng*Department of Computer Science and TechnologyHuazhong University of Science and Technology
Qiong Fang, Wilfred NgDepartment of Computer Science and EngineeringHong Kong University of Science and Technology
* Work done at UIUC SIGMOD 6/10/08
HUST & HKUST 2
Definitions of Rankings and Orders Full ranking
A permutation of n items (or objects). a full ranking T of 6 items: a c b d f e Formalized by a Total Order
A binary relation of items, satisfying the three criteria of anti-symmetry, transitivity, and linearity.
Partial ranking A full ranking of k nonempty buckets Items in the same bucket are tied. Formalized by a Bucket Order
A total order of buckets (i.e, “ties”) a bucket order B: {a, b, c, d} {e, f}
HUST & HKUST 3
Introduction
Input: m full rankings (total order) over n items
Output: a single full ranking over n items Rank aggregation
Voting Meta-search Multi-criteria query
Or Output: A single bucket order over n items ? Bucket Order Discovering (BOD)
HUST & HKUST 4
Motivation (1):Representing Collective Browsing Habits Each user’s habit is reflected in his or her browsing sequence:
user 1: sports weatheruser 2: politics weather…user n: politics weather
Similar users should have similar, but not strictly the same browsing sequences.
A “representative” bucket order of collective browsing habits: {politics, sports} {weather}
frontpage
frontpage
news
{frontpage, news}
HUST & HKUST 5
Motivation (2) : Approximating Bucket Order of Fossil Sites
Seriation in Paleontology Given a 0-1 matrix, find an order of the rows such that
the 1s are as consecutive as possible.
Markov Chain Monte Carlo (MCMC) total orders Puolamäki et al, PLoS Comput Biol’s 2006
The underlying order is indeed a bucket order. Paleontological dataset: g10s10
124 fossil sites the “ground truth” bucket order
15 buckets.
Given the total orders generated by MCMC, linear extensions of the underlying bucket order, We want to find a good approximation of the
underlying bucket order.
s1 s2 s3
f1 0 0 1
f2 1 1 0
f3 0 1 1
f4 0 1 0
f5 1 1 1
s1 s2 s3
f1 0 0 1
f3 0 1 1
f5 1 1 1
f2 1 1 0
f4 0 1 0
Seriation
fossil sitespecies
fossil sitespecies
HUST & HKUST 6
Problem statement:Bucket Order Discovering (BOD) Given m full rankings R={T1, T2, ..., Tm} over n
items,
We want to find a bucket order B such that “representative” perspective: B is a good
“representative” that summarizes R well;
“approximation” perspective : B is a good “approximation” of some “ground truth” bucket order G where R is simply a set of “linear extensions” of G.
HUST & HKUST 7
Outline
Motivation Problem formulation Previous algorithms
The Bucket Pivot Algorithm The Dynamic Programming Algorithm
Our approach The Bucket Gap Algorithm
Experimental study Conclusion
HUST & HKUST 8
What Means a Good Bucket Order? Precedence probability Ptu:
The fraction of the input full rankings in which item t precedes u.
A good bucket order B should well preserve the pair-wise precedence relationship: small |Ptu - 1.0| ==> t should precede u in B. small |Ptu - 0.5| ==> t and u should be “tied” in B. small |Ptu - 0.0| ==> u should precede t in B.
The distance between B and the input full rankings The sum of values |Ptu - 1.0|, or |Ptu - 0.5|, or |Ptu - 0.0| .
HUST & HKUST 9
Distance in Matrix Notation (Gionis et al, KDD’2006) the input pair-order matrix C : Ctu is Ptu. the pair-order matrix CB for bucket order B:
CBtu equals 1.0, if t precedes u in B
CBtu equals 0.0, if u precedes t in B
CBtu equals 0.5, if t and u are “tied” in B
The distance between B and the input full rankings
This is the I-Distance for goodness of “ ”
n
t
n
u
Btutu
B CCCCd ||),(
representative
HUST & HKUST 10
G-Distance for goodness of “approximation”(Gionis et al, KDD’2006) CG : the pair order matrix of the “ground truth” G.
n
t
n
u
Btu
Gtu
BG CCCCd ||),(
HUST & HKUST 11
Formal Definition of BOD
The BOD problem is now formulated as Given a collection of input full rankings, find a bucket order that minimizes I-Distance (or G-
Distance).
This optimization problem is NP-hard. (Gionis et al, KDD’2006.) We have to use heuristic algorithms.
HUST & HKUST 12
Outline
Motivation Problem formulation Previous algorithms
The Bucket Pivot Algorithm The Dynamic Programming Algorithm
Our approach The Bucket Gap Algorithm
Experimental study Conclusion
HUST & HKUST 13
The Bucket Pivot Algorithm (PIVOT) (Gionis et al, KDD’2006) Input: the input pair-order matrix C Output: a bucket order B Idea:
If Ctu is close to 0.5 enough:
0.5 - f ≤ Ctu < 0.5 + f, f : bounding parameter
Then t and u should be put into the same bucket in B.
Else “left” (u t) or “right” (t u)
To avoid checking each Ctu, perform like the quick-sort algorithm Adapted from the FAS-PIVOT algorithm (Ailon et al, STOC’2005)
HUST & HKUST 14
Limitations of PIVOT :Results heavily depend on pivots chosen and f
a b c d e f
a 0.5 0.8 0.6 0.6 0.8 0.6
b 0.2 0.5 0.2 0.4 0.8 0.6
c
d
e
f
The input pair-order matrix C
f is 0.25
• If a is the pivot in 1st recursion:
{a, c, d, f} {b} {e}
• If b is the pivot in 1st recursion:
{a, c} {b, d, f} {e}
• If a is the pivot in 1st recursion:
{a, b, c, d, e, f}f is 0.35
HUST & HKUST 15
The Dynamic Programming Algorithm (DP):(Fagin et al, PODS’2004) Idea: If two items’ median ranks are close enough, they should be put int
o the same bucket. Median_rank(i) = median(T1(i), T2(i), …, Tn(i))
Step 1: pre-processsing to avoid checking “closeness” on median rank between each pair
of items. (MEDRANK, Fagin et al, SIGMOD’2003): sorts n items into
a total order T in non-decreasing order of items’ median ranks T: <a: 1, c: 2, b: 4, d: 4, f : 5, e: 6>
Step 2: using “closeness” on median rank to form buckets Using dynamic programming to segment T into a bucket order B.
HUST & HKUST 16
Two Limitations of DP:from “Approximation” Perspective Limitation 1:
Two items from different buckets in the “ground truth” bucket order G can also have close median ranks.
Limitation 2: DP’s minimizing bucket costs tends to break a big bucket b of G
into several small buckets. Bucket cost:
Observed on g10s10: DP generates 34 buckets, while G has only 15 buckets.
Median rank of the l-th item along a total order T.
average position
HUST & HKUST 17
Outline
Motivation Problem formulation Previous algorithms
The Bucket Pivot Algorithm The Dynamic Programming Algorithm
Our approach The Bucket Gap Algorithm
Experimental study Conclusion
HUST & HKUST 18
The Bucket Gap Algorithm (GAP):Basic Ideas Motivated by the two limitations of DP
Idea 1: If two items are close on multiple quantile ranks, it is more reliable to put them into the same bucket.
Quantile_rank(i) = quantile(T1(i), T2(i), …, Tn(i)) Median rank is the quantile rank w.r.t the quantile 50%.
Idea 2: Items from different buckets should have “abnormally large gaps” between their quantile ranks.
DP’s idea: items in the same bucket should have small gaps between their median ranks.
HUST & HKUST 19
The Bucket Gap Algorithm:A Two Phase Framework Phase 1: check “closeness” of items on each quantile
rank separately. For each quantile, sort all the items in non-decreasing order of thei
r corresponding quantile ranks. Such a total order is called a quantile order.
Use our novel Abnormal Rank Gap heuristic to segment quantile orders into initial bucket orders.
Phase 2: aggregate the “closeness” of items on each quantile rank to generate the final bucket order. Perform a median rank aggregation on the initial bucket orders.
HUST & HKUST 20
MEDRANK+: generating quantile orders
First sort quantiles in increasing order
Quantiles Quantile orders
Q1: < , c: 2, f: 2, b: 3, d: 3, e: 5>
Q2: < , c: 2, b: 4, d: 4, f: 5, e: 6>
Q3: <c: 3, a: 4, d: 4, b: 5, f: 5, e: 6>
Q4: <c: 3, d: 4, b: 5, a: 6, e: 6, f: 6>
Input full rankings
T1: b c d e f
T2: c b d f e
T3: c d b f e
T4: f c a b e
T5: f e d b a
Then, perform a round-robin scan of all the input full rankings.
In each round, output items with their quantile ranks to corresponding quantile orders.
a
a
a
d
c
30%
50%
70%
90%
a: 1
a: 1
HUST & HKUST 21
the Abnormal Rank Gap Heuristic A quantile order Q1:
5 rank gaps:
Average gap ga and Standard deviation sg ga = 4/5, sg = sqrt(14) / 5.
A rank gap gi is abnormal ifgi > average gap + one unit of standard deviation
The Heuristic An abnormal rank gap separates two consecutive buckets. Na abnormal rank gaps (Na +1) buckets
Only g5 is abnormal in Q1
Initial Bucket Order B1: < { a, b, c, d, f }, { e } >
<a: 1, c: 2, f: 2, b: 3, d: 3, e: 5>
g1 =1 g2 =0, g3 =1, g4 =0, g5 =2
HUST & HKUST 22
Median Rank Aggregation on Initial Bucket Orders Put items with the same median rank into the same
bucket in the final bucket order.
Quantile Initial Bucket Order30% B1: < { a, b, c, d, f }, { e } >
50% B2: < { a, c}, { b, d, e, f } >
70% B3: < { a, b, c, d, f }, { e } >
90% B4: < { b, c, d}, { a, e, f } >
Final bucket order B: < { a, b, c, d}, { e, f } >
HUST & HKUST 23
Outline
Motivation Problem formulation Previous algorithms
The Bucket Pivot Algorithm The Dynamic Programming Algorithm
Our approach The Bucket Gap Algorithm
Experimental study Conclusion
HUST & HKUST 24
Experimental study
Algorithms: PIVOT, DP, GAP Only PIVOT has error bars showing one unit of standard deviatio
n.
Datasets Synthetic Datasets.
Noise level: 20% Real Clickstream Dataset
MSNBC Real Paleontology Dataset g10s10.
2,000 sequences, 124 items.
Details of the result are in the paper.
HUST & HKUST 25
Scalability using G-Distance-Synthetic Dataset
0
50
100
150
200
250 500 1000 2000 4000Number of Items
Tim
e(s)
GAP-1K GAP-2KPIVOT-1K PIVOT-2KDP-1K DP-2K
The bottleneck of PIVOT (or using I-Distance): computing the input pair-order matrix costs
O(mn2).
m: number of input full rankingsn: number of items
HUST & HKUST 26
I-Distance and G-Distance - Paleontological data (2,000 sequences, 124 items.)
200
250
300
350
400
450
1 10 20 30 40 50 60 70 80 90 100Number of Quantiles
Dis
tanc
e
GAP PIVOT DP
200
250
300
350
400
450
1 10 20 30 40 50 60 70 80 90 100Number of Quantiles
Dis
tan
ce
GAP PIVOT DP
The adoption of multiple quantile ranks makes sense. Since GAP is fast, we can run it several times to search the best result.
GAP using Median Rank only
HUST & HKUST 27
Conclusions
Introduce a two-phase rank aggregation framework to exploit “closeness” on multiple quantile ranks Can achieve more reliable bucket forming
Introduce the Abnormal Rank Gap Heuristic Can better check “closeness” on single quantile rank Avoid breaking big buckets into small ones.
Future work The general setting:
input full rankings have various lengths. Some theoretical basis:
gain better insight of GAP’s effectiveness.
HUST & HKUST 28
A Note on Correction of Reference 7 Two authors were left out in Reference 7. (Siva
kumar D., and Vee E.)
The correct version should be Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Ve
e E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58.
HUST & HKUST 29
Thank you
Any Question?
HUST & HKUST 30
References
[Ailon,STOC’2005] Ailon, N., Charikar, M., and Newman, A. Aggregating Inconsistent Information:
Ranking and Clustering. ACM STOC., 2005, pp. 684-693. [Fagin, PODS’2004]
Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58.
[Fagin, SIGMOD’2003] Fagin, R., Kumar, R., and Sivakumar, D. Efficient Similarity Search and Classific
ation via Rank Aggregation. ACM SIGMOD, 2003, pp. 301–312. [Gionis, KDD’2006]
Gionis, A., Mannila, H., Puolamaki, K., and Ukkonen, A. Algorithms for Discovering Bucket Orders from Data. ACM KDD, 2006, pp. 561-566.
[Puolamäki , PLoS Comput Biol’s 2006 ] Puolamäki, K., Fortelius, M., and Mannila, H. Seriation in Paleontological Data
Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6, 2006.