discovering bucket orders from full rankings

Discovering Bucket Orders from Full Rankings

Jianlin Feng*Department of Computer Science and TechnologyHuazhong University of Science and Technology

Qiong Fang, Wilfred NgDepartment of Computer Science and EngineeringHong Kong University of Science and Technology

* Work done at UIUC SIGMOD 6/10/08

HUST & HKUST 2

Definitions of Rankings and Orders Full ranking

A permutation of n items (or objects). a full ranking T of 6 items: a c b d f e Formalized by a Total Order

A binary relation of items, satisfying the three criteria of anti-symmetry, transitivity, and linearity.

Partial ranking A full ranking of k nonempty buckets Items in the same bucket are tied. Formalized by a Bucket Order

A total order of buckets (i.e, “ties”) a bucket order B: {a, b, c, d} {e, f}

HUST & HKUST 3

Introduction

Input: m full rankings (total order) over n items

Output: a single full ranking over n items Rank aggregation

Voting Meta-search Multi-criteria query

Or Output: A single bucket order over n items ? Bucket Order Discovering (BOD)

HUST & HKUST 4

Motivation (1):Representing Collective Browsing Habits Each user’s habit is reflected in his or her browsing sequence:

user 1: sports weatheruser 2: politics weather…user n: politics weather

Similar users should have similar, but not strictly the same browsing sequences.

A “representative” bucket order of collective browsing habits: {politics, sports} {weather}

frontpage

frontpage

news

{frontpage, news}

HUST & HKUST 5

Motivation (2) : Approximating Bucket Order of Fossil Sites

Seriation in Paleontology Given a 0-1 matrix, find an order of the rows such that

the 1s are as consecutive as possible.

Markov Chain Monte Carlo (MCMC) total orders Puolamäki et al, PLoS Comput Biol’s 2006

The underlying order is indeed a bucket order. Paleontological dataset: g10s10

124 fossil sites the “ground truth” bucket order

15 buckets.

Given the total orders generated by MCMC, linear extensions of the underlying bucket order, We want to find a good approximation of the

underlying bucket order.

s1 s2 s3

f1 0 0 1

f2 1 1 0

f3 0 1 1

f4 0 1 0

f5 1 1 1

s1 s2 s3

f1 0 0 1

f3 0 1 1

f5 1 1 1

f2 1 1 0

f4 0 1 0

Seriation

fossil sitespecies

fossil sitespecies

HUST & HKUST 6

Problem statement:Bucket Order Discovering (BOD) Given m full rankings R={T1, T2, ..., Tm} over n

items,

We want to find a bucket order B such that “representative” perspective: B is a good

“representative” that summarizes R well;

“approximation” perspective : B is a good “approximation” of some “ground truth” bucket order G where R is simply a set of “linear extensions” of G.

HUST & HKUST 7

Outline

Motivation Problem formulation Previous algorithms

The Bucket Pivot Algorithm The Dynamic Programming Algorithm

Our approach The Bucket Gap Algorithm

Experimental study Conclusion

HUST & HKUST 8

What Means a Good Bucket Order? Precedence probability Ptu:

The fraction of the input full rankings in which item t precedes u.

A good bucket order B should well preserve the pair-wise precedence relationship: small |Ptu - 1.0| ==> t should precede u in B. small |Ptu - 0.5| ==> t and u should be “tied” in B. small |Ptu - 0.0| ==> u should precede t in B.

The distance between B and the input full rankings The sum of values |Ptu - 1.0|, or |Ptu - 0.5|, or |Ptu - 0.0| .

HUST & HKUST 9

Distance in Matrix Notation (Gionis et al, KDD’2006) the input pair-order matrix C : Ctu is Ptu. the pair-order matrix CB for bucket order B:

CBtu equals 1.0, if t precedes u in B

CBtu equals 0.0, if u precedes t in B

CBtu equals 0.5, if t and u are “tied” in B

The distance between B and the input full rankings

This is the I-Distance for goodness of “ ”

n

t

n

u

Btutu

B CCCCd ||),(

representative

HUST & HKUST 10

G-Distance for goodness of “approximation”(Gionis et al, KDD’2006) CG : the pair order matrix of the “ground truth” G.

n

t

n

u

Btu

Gtu

BG CCCCd ||),(

HUST & HKUST 11

Formal Definition of BOD

The BOD problem is now formulated as Given a collection of input full rankings, find a bucket order that minimizes I-Distance (or G-

Distance).

This optimization problem is NP-hard. (Gionis et al, KDD’2006.) We have to use heuristic algorithms.

HUST & HKUST 12

Outline





HUST & HKUST 13

The Bucket Pivot Algorithm (PIVOT) (Gionis et al, KDD’2006) Input: the input pair-order matrix C Output: a bucket order B Idea:

If Ctu is close to 0.5 enough:

0.5 - f ≤ Ctu < 0.5 + f, f : bounding parameter

Then t and u should be put into the same bucket in B.

Else “left” (u t) or “right” (t u)

To avoid checking each Ctu, perform like the quick-sort algorithm Adapted from the FAS-PIVOT algorithm (Ailon et al, STOC’2005)

HUST & HKUST 14

Limitations of PIVOT :Results heavily depend on pivots chosen and f

a b c d e f

a 0.5 0.8 0.6 0.6 0.8 0.6

b 0.2 0.5 0.2 0.4 0.8 0.6

c

d

e

f

The input pair-order matrix C

f is 0.25

• If a is the pivot in 1st recursion:

{a, c, d, f} {b} {e}

• If b is the pivot in 1st recursion:

{a, c} {b, d, f} {e}

• If a is the pivot in 1st recursion:

{a, b, c, d, e, f}f is 0.35

HUST & HKUST 15

The Dynamic Programming Algorithm (DP):(Fagin et al, PODS’2004) Idea: If two items’ median ranks are close enough, they should be put int

o the same bucket. Median_rank(i) = median(T1(i), T2(i), …, Tn(i))

Step 1: pre-processsing to avoid checking “closeness” on median rank between each pair

of items. (MEDRANK, Fagin et al, SIGMOD’2003): sorts n items into

a total order T in non-decreasing order of items’ median ranks T: <a: 1, c: 2, b: 4, d: 4, f : 5, e: 6>

Step 2: using “closeness” on median rank to form buckets Using dynamic programming to segment T into a bucket order B.

HUST & HKUST 16

Two Limitations of DP:from “Approximation” Perspective Limitation 1:

Two items from different buckets in the “ground truth” bucket order G can also have close median ranks.

Limitation 2: DP’s minimizing bucket costs tends to break a big bucket b of G

into several small buckets. Bucket cost:

Observed on g10s10: DP generates 34 buckets, while G has only 15 buckets.

Median rank of the l-th item along a total order T.

average position

HUST & HKUST 17

Outline





HUST & HKUST 18

The Bucket Gap Algorithm (GAP):Basic Ideas Motivated by the two limitations of DP

Idea 1: If two items are close on multiple quantile ranks, it is more reliable to put them into the same bucket.

Quantile_rank(i) = quantile(T1(i), T2(i), …, Tn(i)) Median rank is the quantile rank w.r.t the quantile 50%.

Idea 2: Items from different buckets should have “abnormally large gaps” between their quantile ranks.

DP’s idea: items in the same bucket should have small gaps between their median ranks.

HUST & HKUST 19

The Bucket Gap Algorithm:A Two Phase Framework Phase 1: check “closeness” of items on each quantile

rank separately. For each quantile, sort all the items in non-decreasing order of thei

r corresponding quantile ranks. Such a total order is called a quantile order.

Use our novel Abnormal Rank Gap heuristic to segment quantile orders into initial bucket orders.

Phase 2: aggregate the “closeness” of items on each quantile rank to generate the final bucket order. Perform a median rank aggregation on the initial bucket orders.

HUST & HKUST 20

MEDRANK+: generating quantile orders

First sort quantiles in increasing order

Quantiles Quantile orders

Q1: < , c: 2, f: 2, b: 3, d: 3, e: 5>

Q2: < , c: 2, b: 4, d: 4, f: 5, e: 6>

Q3: <c: 3, a: 4, d: 4, b: 5, f: 5, e: 6>

Q4: <c: 3, d: 4, b: 5, a: 6, e: 6, f: 6>

Input full rankings

T1: b c d e f

T2: c b d f e

T3: c d b f e

T4: f c a b e

T5: f e d b a

Then, perform a round-robin scan of all the input full rankings.

In each round, output items with their quantile ranks to corresponding quantile orders.

a

a

a

d

c

30%

50%

70%

90%

a: 1

a: 1

HUST & HKUST 21

the Abnormal Rank Gap Heuristic A quantile order Q1:

5 rank gaps:

Average gap ga and Standard deviation sg ga = 4/5, sg = sqrt(14) / 5.

A rank gap gi is abnormal ifgi > average gap + one unit of standard deviation

The Heuristic An abnormal rank gap separates two consecutive buckets. Na abnormal rank gaps (Na +1) buckets

Only g5 is abnormal in Q1

Initial Bucket Order B1: < { a, b, c, d, f }, { e } >

<a: 1, c: 2, f: 2, b: 3, d: 3, e: 5>

g1 =1 g2 =0, g3 =1, g4 =0, g5 =2

HUST & HKUST 22

Median Rank Aggregation on Initial Bucket Orders Put items with the same median rank into the same

bucket in the final bucket order.

Quantile Initial Bucket Order30% B1: < { a, b, c, d, f }, { e } >

50% B2: < { a, c}, { b, d, e, f } >

70% B3: < { a, b, c, d, f }, { e } >

90% B4: < { b, c, d}, { a, e, f } >

Final bucket order B: < { a, b, c, d}, { e, f } >

HUST & HKUST 23

Outline





HUST & HKUST 24

Experimental study

Algorithms: PIVOT, DP, GAP Only PIVOT has error bars showing one unit of standard deviatio

n.

Datasets Synthetic Datasets.

Noise level: 20% Real Clickstream Dataset

MSNBC Real Paleontology Dataset g10s10.

2,000 sequences, 124 items.

Details of the result are in the paper.

HUST & HKUST 25

Scalability using G-Distance-Synthetic Dataset

0

50

100

150

200

250 500 1000 2000 4000Number of Items

Tim

e(s)

GAP-1K GAP-2KPIVOT-1K PIVOT-2KDP-1K DP-2K

The bottleneck of PIVOT (or using I-Distance): computing the input pair-order matrix costs

O(mn2).

m: number of input full rankingsn: number of items

HUST & HKUST 26

I-Distance and G-Distance - Paleontological data (2,000 sequences, 124 items.)

200

250

300

350

400

450

1 10 20 30 40 50 60 70 80 90 100Number of Quantiles

Dis

tanc

e

GAP PIVOT DP

200

250

300

350

400

450

1 10 20 30 40 50 60 70 80 90 100Number of Quantiles

Dis

tan

ce

GAP PIVOT DP

The adoption of multiple quantile ranks makes sense. Since GAP is fast, we can run it several times to search the best result.

GAP using Median Rank only

HUST & HKUST 27

Conclusions

Introduce a two-phase rank aggregation framework to exploit “closeness” on multiple quantile ranks Can achieve more reliable bucket forming

Introduce the Abnormal Rank Gap Heuristic Can better check “closeness” on single quantile rank Avoid breaking big buckets into small ones.

Future work The general setting:

input full rankings have various lengths. Some theoretical basis:

gain better insight of GAP’s effectiveness.

HUST & HKUST 28

A Note on Correction of Reference 7 Two authors were left out in Reference 7. (Siva

kumar D., and Vee E.)

The correct version should be Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Ve

e E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58.

HUST & HKUST 29

Thank you

Any Question?

HUST & HKUST 30

References

[Ailon,STOC’2005] Ailon, N., Charikar, M., and Newman, A. Aggregating Inconsistent Information:

Ranking and Clustering. ACM STOC., 2005, pp. 684-693. [Fagin, PODS’2004]

Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58.

[Fagin, SIGMOD’2003] Fagin, R., Kumar, R., and Sivakumar, D. Efficient Similarity Search and Classific

ation via Rank Aggregation. ACM SIGMOD, 2003, pp. 301–312. [Gionis, KDD’2006]

Gionis, A., Mannila, H., Puolamaki, K., and Ukkonen, A. Algorithms for Discovering Bucket Orders from Data. ACM KDD, 2006, pp. 561-566.

[Puolamäki , PLoS Comput Biol’s 2006 ] Puolamäki, K., Fortelius, M., and Mannila, H. Seriation in Paleontological Data

Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6, 2006.

discovering bucket orders from full rankings

Documents