cs246 ranked queries. junghoo "john" cho (ucla computer science)2 traditional database...

CS246

Ranked Queries

Junghoo "John" Cho (UCLA Computer Science) 2

Traditional Database Query

(Dept = “CS”) & (GPA > 3.5) Boolean semantics

Clear boundary between “answers” and “non-answers”

Goal: Find all “matching” tuples Optionally ordered by a certain field

T: All Tuples

A: AnswerClear boundary


Ranked Queries

Find “cheap” houses “close” to UCLA Cheap(x) & NearUCLA(x)

Non-Boolean semantics No clear boundary between “answers” and “non-

answers” Answers inherently ranked

Goal: Find top ranked tuples

T: All Tuples

A: Answer

No clear boundary


Issues?

How to rank? Distance 3 miles: proximity? Similarity: looks like “Tom Cruise”?

How to combine rankings? Price = 0.8, Distance = 0.2. Overall grade?

Weighting? Price is twice more “important” than distance?

Query processing? Traditional query processing is based on Boolean

semantics


Fagin’s paper

Previously all of the 4 issues were a “black art” No disciplined way to address the problems

Fagin’s paper studied the last 3 issues in a more “disciplined” way Combination of ranks Weighting Query processing Find general “properties” and derive a formula

satisfying the properties


Topics

Combining multiple grades Weighting Efficient query processing


Rank Combination

Cheap(x) & NearUCLA(x) Cheap(x) = 0.3 NearUCLA(x) = 0.8 Overall ranking? How would you approach the problem?


General Query

(Cheap(x) & (NearUCLA(x) | NearBeach(x))) & RedRoof(x)

How to compute the overall grade?

Cheap NearUCLA NearBeach RedRoof

|

&

&

0.3 0.2 0.8 0.6


Main Idea

Atomic scoring function A(x): given by application

Cheap(x) = 0.3, NearUCLA(x) = 0.2 …

Query: recursive application of AND and OR (Cheap & (NearUCLA | NearBeach)) & RedRoof

Combination of two grades for “AND” and “OR” 2-nary function: t: [0, 1]2 [0,1] Example: min(a, b) for “AND”?

Cheap & NearUCLA (x) = min(0.3, 0.2) = 0.2

Properties of AND/OR scoring function?


Properties of Scoring Function

Logical equivalence The same overall score for logically equivalent

queries A&(B|C)(x) = (A&B)|(A&C)(x)

Monotonicity if A(x1) < A(x2) and B(x1) < B(x2),

then A&B(x1) < A&B(x2)

t(x1, x2) < t(x’1, x’2) if xi< x’I for all i


Uniqueness Theorem

The min() and max() are the only scoring functions with the two properties Min() for “AND” and Max() for “OR”

Quite surprising and interesting result More discussion later

Is logical equivalence really true?


Question on Logical Equivalence? Query: Homepage of “John Grisham” PageRank & John & Grisham Logically equivalent, but are they same? Does logical equivalence hold for non-Boolean

queries?

PR John Grisham

&

&

PR John Grisham

&

&


Summary of Scoring Function

Question: how to combine rankings Scoring function: combine grades Results from fuzzy logic

Logical equivalence Monotonicity Uniqueness theorem

Min() for “AND” and Max() for “OR” Logical equivalence may not be valid for

graded Boolean expression


Topics



Weighting of Grades

Cheap(x) & NearUCLA(x) What if proximity is “more important” than

price? Assign weights to each atomic query

Cheap(x) = 0.2, weight = 1 NearUCLA(x) = 0.8, weight = 10 Proximity is 10 times more important than price Overall grade?


Formalization

m-atomic queries = (1, …, m) : weight of each atomic query

X = (x1, …, xm) : grades from each atomic query

f (x1, …, xm) : unweighted scoring function

f(x1, …, xm) : new weighted scoring function

What should f(x1, …, xm) be given ? Properties of f(x1, …, xm)?


Properties

P1: When all weights are equalf(1/m, …, 1/m)(x1, …, xm) = f (x1, …, xm)

P2: If an argument has zero weight, we can safely drop the argument f(1, …, m-1, 0) (x1, …, xm) = f(1, …, m-1)(x1, …, xm-1)

P3: f(X) should be locally linear f+(1-)’(x1, …, xm) =

f(x1, …, xm) + (1-) f’(x1, …, xm)


Local Linearity Example

1 = (1/2, 1/2), f1(X) = 0.22 = (1/4, 3/4), f2(X) = 0.4

If 3 = (3/8, 5/8) = 1/2 1+ 1/2 2

f3(X) = 1/2 f1(X) + 1/2 f2(X) = 0.3 Q: m-atomic queries. How many independent

weight assignments? A: m. Only m degrees of freedom

Very strong assumption Not too unreasonable, but no rationale


Theorem

1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +3·(3 - 4) f (x1 , x2 , x3) +…m· m · f (x1 , …, xm)is the only function that satisfies such properties


Examples

= (1/3, 1/3, 1/3) 1·(1/3-1/3) f (x1) + 2·(1/3-1/3) f (x1, x2) + 3·(1/3) f (x1 , x2 , x3)

= f (x1 , x2 , x3)

= (1/2, 1/4, 1/4) 1·(1/2-1/4) f (x1) + 2·(1/4-1/4) f (x1, x2) + 3·(1/4) f (x1 , x2 , x3)

= 1/4 f (x1) + 3/4 f (x1 , x2 , x3)

= (1/2, 1/3, 1/6) 1·(1/2-1/3) f (x1) + 2·(1/3-1/6) f (x1, x2) + 3·(1/6) f (x1 , x2 , x3)

= 1/6 f (x1) + 2/6 f (x1 , x2) + 3/6 f (x1 , x2 , x3)


Summary of Weighting

Question: different “importance” of grades = (1, …, m): weight assignment Uniqueness theorem

Local linearity and two other reasonable assumption

1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +…m· m · f (x1 , …, xm)

Linearity assumption questionable


Application?

Web page ranking PageRank & (Keyword1 & Keyword2 & …)

Should we use min()? min(keyword1, keyword2, keyword3,…) Would it be better than the cosine measure?

If PageRank is 10 times more important, should we use Fagin’s formula? 9/11 PR + 2/11 min(PR, min(keywords)) Would it be better than other ranking function?

Is Fagin’s formula practical?


Topics



Question

How can we process ranked queries efficiently? Top k answers for “Cheap(x) & NearUCLA(x)” Assume we have good scoring functions

How do we process traditional Boolean query? GPA > 3.5 & Dept = “CS”

What’s the difference? What is difficult compared to Boolean query?


Naïve Solution

Cheap(x) & NearUCLA(x)1. Read prices of all houses

2. Compute distances of all houses

3. Compute combined grades of all houses

4. Return the k-highest grade objects Clearly very expensive when database is

large


Main Idea

We don’t have to check all objects/tuples Most tuples have low grades and will not be

returned Basic algorithm

Check top objects from each atomic query and find the best objects

Question: How many objects should we see from each “atomic query”?


Architecture

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.78…

b: 0.9d: 0.9a: 0.75…

f (x1, x2, x3)

b: 0.78a: 0.75

How many to check?How to minimize it?

Sorted access Random access

any monotonic function


Three Papers

Fuzzy queries Optimal aggregation Minimal probing


Fagin’s Model

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.78…

b: 0.9d: 0.9a: 0.75…

f (x1, x2, x3)

Sorted access Sorted access Sorted access


Fagin’s Model

Sorted access on all streams Cost model: # objects accessed by sorted/random

accessescs s + cr r

Ignore the cost for “sorting” Reasonable when objects have been sorted already

Sorted index

Inappropriate when objects have not been sorted We have to compute grades for all objects Sorting can be costly


Main Question

How many objects to access? When can we stop?

A: When we know that we have seen at least k objects whose scores are higher than any unseen objects


Fagin’s First Algorithm

Read objects from each stream in parallel

Stop when k objects have been seen in common from all streams

Top answers should be in the union of the objects that we have seen

Why?

f (x1, x2, x3)

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.78…

b: 0.9d: 0.9a: 0.75…

k objects


Stopping Condition

Reason The grades of the k objects in the intersection is

higher than any unseen objects Proof

x: object in the intersection, y: unseen object y1 x1. Similarly yi xi for all i

f (y1, …, ym) f (x1, …, xm) due to monotonicity


Fagin’s First Algorithm

1. Get objects from each stream in parallel until we have seen k objects in common from all streams

2. For all objects that we have seen so far If its complete grade is not known, obtain

unknown grades by random access

3. Find the object with the highest grade


Example (k = 2)

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.5…

min(x1, x2)

d: 0.6 c: 0.2

0.6

a 0.9

d 0.9

0.2

0.85

b 0.8

c 0.7

0.5

0.6

0.2

x1 x2 min

0.85

0.5

a: 0.85d: 0.6


Performance

We only look at a subset of objects Ignoring high cost for random access, clearly

better than the naïve solution Total number of accesses

O(N(m-1)/m k1/m) assuming independent and random object order for each atomic query

E.g., O(N1/2 k1/2) if m = 2


Summary of Fagin’s Algorithm

Sorted access on all streams Stopping condition

k common objects from all streams


Problem of Fagin’s Algorithm Performance depends heavily on object orders in

the streams k = 1, min(x1, x2)

We need to read all objects Sorted access until 3rd objects and random access for all

remainder Can we avoid this pathological scenario?

b: 1a: 1c: 1d: 0e: 0

e: 1d: 1b: 1c: 0a: 0


New Idea

Let us read all grades of an object once we see it from a sorted access Do not need to wait until the streams give k

common objects Less dependent on the object order

When can we stop? Until we have seen k common objects from sorted

accesses?


When Can We Stop?

If we are sure that we have seen at least k objects whose grades are higher than those of unseen objects

How do we know the grades of unseen objects?

Can we predict the maximum grade of unseen objects?


Maximum Grade of Unseen Objects Assuming min(x1, x2), what will be the

maximum grade of unseen objects?

a: 1b: 0.9c: 0.8d: 0.7e: 0.6

e: 1d: 0.8b: 0.7c: 0.7a: 0.2

x1 < 0.8 and x2 < 0.7, so at most min(0.8, 0.7) = 0.7

Generalization?


Generalization

xi: the minimum grade from stream i by sorted access

f (x1, …, xm) is the maximum grade of unseen objects xi < xi for all unseen objects

f (x1, …, xm): monotonic x1

x1

x2

x2


Basic Idea of TA

We can stop when top k seen object grades are higher than the maximum grade of unseen objects Maximum grade of unseen objects: f (x1, …, xm)


Threshold Algorithm

1. Read one object from each stream by sorted access

2. For each object O that we just read Get all grades for O by random access If f (O) is in top k, store it in a buffer

3. If the lowest grade of top k object is larger than the threshold f (x1, …, xm) stop


f (0.9,0.9) = 0.9f (0.8,0.85) = 0.8f (0.7,0.5) = 0.5

Example (k = 2)

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.5…

min(x1, x2)

d: 0.6 c: 0.2

a 0.9

d 0.9

b 0.8

0.6 0.6

x1 x2 min

0.85 0.85

0.5 0.5

a: 0.85d: 0.6

c 0.7 0.2 0.2

f (1,1) = 1


Comparison of FA and TA?

TA sees fewer objects than FA TA always stops earlier than FA

When we have seen k objects in common, their grades are higher than the threshold

TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for

missing grades

TA requires bounded buffer space (k) At the expense of more random seeks


Comparison of FA and TA

TA can be better in general, but it may perform more random seeks

What if random seek is very expensive or impossible? Algorithm with no random seek possible?


Algorithm NRA

An algorithm with no random seek Isn’t random seek essential?

How can we know the grade of an object when some of its grades are missing?


Basic Idea

We may still compute the lower bound of an object, even if we miss some of its grades E.g., max(0.6, x) 0.6

We may also compute the upper bound of an object, even if we miss some of its grades E.g., max(0.6, x) 0.8 if x 0.8

If the lower bound of O1 is higher than the upper bound of other objects, we can return O1


Generalization

(x1, …, xm): the minimum grades from sorted access

Lower bound of object: 0 for missing grades When x3, x4 are missing, f (x1, x2, 0, 0) From monotonicity

Upper bound of object: xi for missing grades When x3, x4 are missing, f (x1, x2, x3, x4)

x3 x3, x4 x4, thus f (x1, x2, x3, x4) f (x1, x2, x3, x4)


NRA Algorithm

1. Read one object from each stream by sorted access. Assume (x1, …, xm) are the lowest grades from the streams

2. For each object O seen so far Update its upper/lower bounds by

Upper bound = use xi for missing grades Lower bound = use 0 for missing grades

3. If lower bounds of top k objects are larger than upper bounds of any other object, stop


AVG(0.5,0.7)=0.6

AVG(0.5,0.2)=0.35

AVG(0.3,0.7)=0.5

AVG(0.5,0.6) = 0.55AVG(0.3,0.2) = 0.25

Example (k = 2)

a: 0.9b: 0.5c: 0.3…

d: 0.7a: 0.6e: 0.2…

AVG(x1, x2)a 0.9

d 0.7

0.6

b 0.5

AVG(0,0.7)=0.35

AVG(0.3,0)=0.15

x1 x2 Lower Bound

AVG(0.9,0)=0.45

AVG(0.5,0)=0.25

a, d

c 0.3

0.2e AVG(0,0.2)=0.1

AVG(0.9,0.7)=0.8

AVG(0.3,0.2)=0.25

AVG(0.9,0.7)=0.8

AVG(0.5,0.6)=0.55

AVG(0.3,0.2)=0.25

Upper Bound

0.75 0.75

AVG(0.9,0.7) = 0.8


Properties of NRA

No random access We may return an object even if we don’t

know its grade We may only know its lower bound

We need to constantly update the upper bounds of objects As threshold value decreases


Chang’s View

Computing grades can be expensive Sorting is expensive Minimize sorted access


Chang’s Model

Sorted access on one stream and random access on the remaining streams At least one sorted access necessary to

“discover” objects Cost model: # of random accesses Reasonable when the objects are not sorted

for some streams


Chang’s Model

a: 0.9b: 0.8c: 0.7…

d: 0.9a: 0.85b: 0.78…

b: 0.9d: 0.9a: 0.75…

f (x1, x2, x3)

Sorted access

Random access


Chang’s Solution

Main Idea? Probe only necessary attributes A probe is necessary iff we cannot find the

right answer without it Which probe is necessary?


Necessary Probes

Assume attribute probe order is fixed Assume min() is the scoring function Assume the threshold (grade of kth highest

object) is 0.7 a : (0.9, 0.3, 0.2)

Is the the second probe necessary? b : (0.5, 0.7, 0.3)

Is the second probe necessary? Is the necessity dependent on algorithm?


Observation

Probe necessity is independent of algorithm Purely dependent on the dataset Assuming probe order is fixed

How do we find the necessary probes? When the upper bound of the grade of object O goes below

the threshold, no more probe is necessary from the object

How do we find the threshold value? We know the upper bound of the threshold Threshold upper bound =

kth upper bound of grades (gk)…

Upper bound of Object grades

k

g1

gk


Algorithm MPro

As long as we probe objects with grades above the threshold upper bound, we are safe

Q: a priority queue for the upper bound grades of objects

Pick the top object O from Q Probe the next attribute of O Stop if we have the complete grade for the top k

objects in Q


Property of MPro

MPro is optimal in the exact sense (not in the big O sense) All probes are necessary Assuming we need to compute the complete grades for all

returned objects Assuming object probing order fixed

No other algorithm can beat MPro Does it work for max()?

Performance depends on the scoring function Good only when the upper bound is “tight”


Other Issues for MPro

How to select the attribute probing order Mpro is optimal given a particular probing order Attribute probing order affects performance

significantly Probe order estimation from sampling

How to parallelize Mpro Probe top k objects simutaneously


Summary

Efficient processing of ranked queries Sorted access Random access

FA: k common objects TA: threshold value NRA: upper and lower bounds MPro: necessary probe principle


Hints on Paper Writing

The goal of a paper is to be read and used by other people Should be easy to understand

Tricky balance

How to make a paper easy to read? Explicitly specify your assumptions

Readers do not know what you think! Use examples Run experiments

cs246 ranked queries. junghoo "john" cho (ucla computer science)2 traditional database...

Documents