optimal top-k generation of attribute combinations based on ranked lists jiaheng lu, renmin...

Optimal Top-k Generation of Attribute Combinations based on

Ranked Lists

Jiaheng Lu, Renmin University of China

Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong Du, Shan Wang, and Xinxing Chen

Motivation & Problem Statement

Goal: Select a combination of three players including forward, center, and guard positions.

CenterForward GuardF1 F2 C1 C2 G1 G2

Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris

(G02, 8.91)(G08, 8.07)(G05, 7.54)(G10, 7.52)(G03, 6.14)(G01, 5.05)(G04, 5.01)(G09, 3.34)……(G06, 3.01)

(G01, 3.81)(G06, 3.59)(G04, 3.21)(G07, 3.03)(G09, 2.07)(G11, 1.70)(G10, 1.62)(G02, 1.59)……(G08, 1.19)

(G02, 6.59)(G03, 6.19)(G04, 5.81)(G05, 4.01)(G01, 3.38)(G09, 2.25)(G06, 1.52)(G08, 1.51)……(G07, 1.00)

(G09, 7.10)(G03, 6.01)(G04, 3.79)(G08, 3.02)(G05, 2.89)(G02, 2.52)(G01, 2.00)(G10, 1.59)……(G06, 1.52)

(a) Source data of three groups

(G01, 9.31)(G07, 9.02)(G03, 8.87)(G04, 5.02)(G11, 4.81)(G08, 4.02)(G06, 4.31)(G05, 3.59)……(G09, 2.06)

(G05, 7.21)(G02, 6.01)(G06, 5.58)(G10, 5.51)(G04, 5.00)(G11, 3.09)(G01, 2.06)(G08, 2.03)……(G09, 1.98)

Methods

select players with the highest score in each group

calculate the average scores of players across all games

… …

Limitation:overlook team spirit

Game IDScore


consider their combined scores in the same game

Select the top-k combinations according to top-m aggregate scores

Top-k,m Problem

Tuple aggregation functionInstance aggregation

function

Motivation & Problem StatementTop-1,2: select the top-1 combination of players according to top-2 aggregate scores for games where they played together.

F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score.

CenterForward GuardF1 F2 C1 C2 G1 G2

Juwan Howard LeBron James Chris Bosh Eddy Curry Dwyane Wade Terrel Harris

(G02, 8.91)(G08, 8.07)(G05, 7.54)(G10, 7.52)(G03, 6.14)(G01, 5.05)(G04, 5.01)(G09, 3.34)……(G06, 3.01)

(G01, 3.81)(G06, 3.59)(G04, 3.21)(G07, 3.03)(G09, 2.07)(G11, 1.70)(G10, 1.62)(G02, 1.59)……(G08, 1.19)

(G02, 6.59)(G03, 6.19)(G04, 5.81)(G05, 4.01)(G01, 3.38)(G09, 2.25)(G06, 1.52)(G08, 1.51)……(G07, 1.00)

(G09, 7.10)(G03, 6.01)(G04, 3.79)(G08, 3.02)(G05, 2.89)(G02, 2.52)(G01, 2.00)(G10, 1.59)……(G06, 1.52)

(a) Source data of three groups

(b) Top-2 aggregate scores for each combination

(G01, 9.31)(G07, 9.02)(G03, 8.87)(G04, 5.02)(G11, 4.81)(G08, 4.02)(G06, 4.31)(G05, 3.59)……(G09, 2.06)

(G05, 7.21)(G02, 6.01)(G06, 5.58)(G10, 5.51)(G04, 5.00)(G11, 3.09)(G01, 2.06)(G08, 2.03)……(G09, 1.98)

F1C1G1

(G04, 15.83)(G05, 14.81)

F1C1G2 F1C2G1 F1C2G2 F2C1G1 F2C1G2 F2C2G1 F2C2G2

(G04, 13.81)(G05, 13.69)

(G01, 16.50)(G04, 14.04)

(G01, 15.12)(G07, 12.05)

(G02, 21.51)(G05, 18.76)

(G05, 17.64)(G02, 17.44)

(G02, 17.09)(G04, 14.03)

(G02, 13.02)(G09, 12.51)

… … … … … … … …

Difference between top-k queries and top-k,m queries

Top-k Top-k,m

Return the top-k tuplesReturn the top-k

combinations of attributes

Can be transformed into a SQL

Cannot be transformed into a SQL

ApplicationXML keyword refinement

ExampleExample

Q = {DB;UC Irvine; 2002} Groups: G1 = {"DB"; "database"}, G2={"UCI";"UC Irvine"}G3 = {"2002"}.

Answer:

Q’={DB, UCI, 2002}

Consider a top-1,2 query

Application (Cont.)

• Evidence combination mining in medical databases

• Package recommendation systems

• …


Top-k,m Query Processing

Experimental Results

Conclusion

Outline

Top-k,m Query ProcessingAccess Model: Sorted Accesses

(a, 9.0)(b, 8.7)(c, 8.7)(d, 7.4)

(i, 5.3)

……

(i, 8.8)

(f, 6.9)

(a, 7.5)

(d, 4.7)

(c, 7.9)

……

Top-k,m Query ProcessingAccess Model: Random Accesses

(a, 9.0)(b, 8.7)(c, 8.7)(d, 7.4)

(i, 5.3)

……

(i, 8.8)

(f, 6.9)

(a, 7.5)

(d, 4.7)

(c, 7.9)

……

Top-k,m Query ProcessingBaseline Method: ETA

Compute top-m tuples for each

combinationThreshold Algorithm

(TA)

Calculate aggregate score for each combination

Return the top-k combinations

(a, 10)(b,7.4)

……

(c, 8.3)(d, 7.1)

……

(b, 9)(a, 8)

……

(e, 8)(f, 7)

……

A1 A2 B1 B2

G1 G2

…... …...(b, 9)(a, 8)

……

(e, 8)(f, 7)

……

C1 C2

G3

…...

(A1, B1, C1) 100 (A1, B1, C2) 105

Top-k,m Query ProcessingUpper and Lower bounds Algorithm: ULA

Consider top-m seen match instances

Lower Bound

Consider threshold value and top-m match instances

Upper Bound

Compute the upper and lower bounds for

each combination

Termination condition:

k combinations meet the hit-condition

A1 A2 B1 B2

Group 1 Group 2

(G1, 9.3)

(G2, 8.3)

(G5,7.8)

(G11,7.3)

(G2, 7.9)

(G1,7.0)

(G4,8.0)

(G8,7.3)

(G8, 3.0)

(G4, 2.6)

(G4, 1.8)

(G2, 1.5)

(G11, 4.2)

(G5, 3.3)

(G2, 4.4)

(G1, 2.3)

Upper and Lower bounds Algorithm: ULA

A1B1

A1 A2 B1 B2

Group 1 Group 2

(G1, 9.3)

(G2, 8.3)

(G5,7.8)

(G11,7.3)

(G2, 7.9)

(G1,7.0)

(G4,8.0)

(G8,7.3)

(G8, 3.0)

(G4, 2.6)

(G4, 1.8)

(G2, 1.5)

(G11, 4.2)

(G5, 3.3)

(G2, 4.4)

(G1, 2.3)


A1B1U: 34.4

L: 32.5

A1B2U: 34.6

L: 22.2

A2B1U: 31.4

L: 20.5

A2B2U: 31.6

L: 9.8

L: 32.5A1B1 （ G1, 9.3+7.0 ） , （ G2, 7.9+8.3）

U: 31.4A2B1 Threshold value=7.8+7.9=15.7, 15.7*2=31.4

A1 A2 B1 B2

Group 1 Group 2

(G1, 9.3)

(G2, 8.3)

(G5,7.8)

(G11,7.3)

(G2, 7.9)

(G1,7.0)

(G4,8.0)

(G8,7.3)

(G8, 3.0)

(G4, 2.6)

(G4, 1.8)

(G2, 1.5)

(G11, 4.2)

(G5, 3.3)

(G2, 4.4)

(G1, 2.3)


A1B1U: 32.5

L: 32.5

A1B2U: 31.2

L: 22.2

A1B1

Can we run fast?

Optimization heuristics (1)Pruning combinations without computing the bounds

(A3,B2) is dominated by (A2,B1)

6.3<7.1 and 8.0<8.2

Optimization heuristics (2)Reducing the number of accesses

Avoiding both sorted and random accesses for specific lists

(a, 10)(b,7.4)

……

(c, 8.3)(d, 7.1)

……

(a, 6.3)(d, 4)

……

(b, 9)(a, 8)

……

(e, 8)(f, 7)

……

A1 A2 A3 B1 B2

G1 G2

(A1,B1)and(A1,B2) cannot be part of answers, all sorted accesses and random accesses on list A1 are unnecessary.


Reducing random accesses across two lists

(a, 10)(b,7.4)

……

(c, 8.3)(d, 7.1)

……

(b, 9)(a, 8)

……

(e, 8)(f, 7)

……

A1 A2 B1 B2

G1 G2

(e, 9)(d, 7)

……

(k, 8)(f, 7)

……

C1 C2

G3

(A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random accesses between A1 and B1 are unnecessary.


Eliminating random accesses for specific tuples

Random access from Le to Lt for tuple x is useless

Prune dominated

combinations

Compute upper and lower bounds for unterminated combinations

Terminate combinations by reducing number of accesses

Until k combinations meet hit-condition


ULA+

Interesting theoretical results Optimality properties

Instance OptimalityInstance Optimality

If wild guesses are not allowed, and the size of each group is treated as a constant, then ULA and ULA+ are instance-optimal.

The upper bound of the optimality ratio is tightThe upper bound of the optimality ratio is tight

for every instance there exist two constants a and bsuch that cost(A) <= a*cost(A’) + b

Interesting theoretical results (Cont.) Optimality properties

No Instance Optimal AlgorithmsNo Instance Optimal Algorithms

If wild guesses are allowed, Then there is no deterministic algorithm that is instance-optimal.




Conclusion

Outline


Experimental Setup

Language: Java; OS: Windows XP; CPU: 2.0GHz; Disk:320GB

Data sets

Experimental ResultsExperimental results on NBA and YQL datasets

ULA+ outperforms ETA by 1-2 orders of magnitude both in running time and access number.

Experimental ResultsPerformance of optimization to reduce combinations

More than 60% combinations are pruned without computing their bounds

Experimental ResultsPerformance of different optimizations

Combination of all optimizations has the most powerful pruning capability.

Experimental ResultsExperimental results on XML DBLP dataset

XULA and XULA+ perform better than XETA and scale well in both running time and number of accesses.

U. Güntzer etc, VLDB2000 S. Nepal etc, ICDE1999

Fagin etc, PODS 2001

Top-k with both random and

sorted accesses

R. Fagin etc, JCSS2003 N. Mamoulis etc, TDS2007

Fagin etc, PODS 2001

Top-k with only sorted accesses

Related Works

I. F. Ilya etc, VLDB2002Top-k with no need for exact

aggregate score

C. Li etc, SIGMOD2006 M. L. Yiu etc, DKE2008

Ad-hoc top-k queries

Related Works

N. Bruno etc, ICDE2002 K. C. C. Chang etc, SIGMOD2002

Top-k with sorted access on

restricted lists

Related Works

T. Deng, W, Fan and F. Geerts,On the Complexity of Package

Recommendation Problems PODS 2012

Top-k Package recommendation




Conclusion

Outline

Conclusion

• Propose a new problem called top-k,m query evaluation

• Developed a family of efficient algorithms, including ULA and ULA+

• Study the optimality properties of our algorithms

• Apply top-k,m query to the context of XML keyword query refinement

Optimal Top-k Generation of Attribute Combinations based on

Ranked Lists

optimal top-k generation of attribute combinations based on ranked lists jiaheng lu, renmin...

Documents