cs246 ranked queries. junghoo "john" cho (ucla computer science)2 traditional database...
Post on 22-Dec-2015
214 views
TRANSCRIPT
![Page 1: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/1.jpg)
CS246
Ranked Queries
![Page 2: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/2.jpg)
Junghoo "John" Cho (UCLA Computer Science) 2
Traditional Database Query
(Dept = “CS”) & (GPA > 3.5) Boolean semantics
Clear boundary between “answers” and “non-answers”
Goal: Find all “matching” tuples Optionally ordered by a certain field
T: All Tuples
A: AnswerClear boundary
![Page 3: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/3.jpg)
Junghoo "John" Cho (UCLA Computer Science) 3
Ranked Queries
Find “cheap” houses “close” to UCLA Cheap(x) & NearUCLA(x)
Non-Boolean semantics No clear boundary between “answers” and “non-
answers” Answers inherently ranked
Goal: Find top ranked tuples
T: All Tuples
A: Answer
No clear boundary
![Page 4: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/4.jpg)
Junghoo "John" Cho (UCLA Computer Science) 4
Issues?
How to rank? Distance 3 miles: proximity? Similarity: looks like “Tom Cruise”?
How to combine rankings? Price = 0.8, Distance = 0.2. Overall grade?
Weighting? Price is twice more “important” than distance?
Query processing? Traditional query processing is based on Boolean
semantics
![Page 5: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/5.jpg)
Junghoo "John" Cho (UCLA Computer Science) 5
Fagin’s paper
Previously all of the 4 issues were a “black art” No disciplined way to address the problems
Fagin’s paper studied the last 3 issues in a more “disciplined” way Combination of ranks Weighting Query processing Find general “properties” and derive a formula
satisfying the properties
![Page 6: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/6.jpg)
Junghoo "John" Cho (UCLA Computer Science) 6
Topics
Combining multiple grades Weighting Efficient query processing
![Page 7: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/7.jpg)
Junghoo "John" Cho (UCLA Computer Science) 7
Rank Combination
Cheap(x) & NearUCLA(x) Cheap(x) = 0.3 NearUCLA(x) = 0.8 Overall ranking? How would you approach the problem?
![Page 8: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/8.jpg)
Junghoo "John" Cho (UCLA Computer Science) 8
General Query
(Cheap(x) & (NearUCLA(x) | NearBeach(x))) & RedRoof(x)
How to compute the overall grade?
Cheap NearUCLA NearBeach RedRoof
|
&
&
0.3 0.2 0.8 0.6
![Page 9: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/9.jpg)
Junghoo "John" Cho (UCLA Computer Science) 9
Main Idea
Atomic scoring function A(x): given by application
Cheap(x) = 0.3, NearUCLA(x) = 0.2 …
Query: recursive application of AND and OR (Cheap & (NearUCLA | NearBeach)) & RedRoof
Combination of two grades for “AND” and “OR” 2-nary function: t: [0, 1]2 [0,1] Example: min(a, b) for “AND”?
Cheap & NearUCLA (x) = min(0.3, 0.2) = 0.2
Properties of AND/OR scoring function?
![Page 10: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/10.jpg)
Junghoo "John" Cho (UCLA Computer Science) 10
Properties of Scoring Function
Logical equivalence The same overall score for logically equivalent
queries A&(B|C)(x) = (A&B)|(A&C)(x)
Monotonicity if A(x1) < A(x2) and B(x1) < B(x2),
then A&B(x1) < A&B(x2)
t(x1, x2) < t(x’1, x’2) if xi< x’I for all i
![Page 11: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/11.jpg)
Junghoo "John" Cho (UCLA Computer Science) 11
Uniqueness Theorem
The min() and max() are the only scoring functions with the two properties Min() for “AND” and Max() for “OR”
Quite surprising and interesting result More discussion later
Is logical equivalence really true?
![Page 12: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/12.jpg)
Junghoo "John" Cho (UCLA Computer Science) 12
Question on Logical Equivalence? Query: Homepage of “John Grisham” PageRank & John & Grisham Logically equivalent, but are they same? Does logical equivalence hold for non-Boolean
queries?
PR John Grisham
&
&
PR John Grisham
&
&
![Page 13: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/13.jpg)
Junghoo "John" Cho (UCLA Computer Science) 13
Summary of Scoring Function
Question: how to combine rankings Scoring function: combine grades Results from fuzzy logic
Logical equivalence Monotonicity Uniqueness theorem
Min() for “AND” and Max() for “OR” Logical equivalence may not be valid for
graded Boolean expression
![Page 14: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/14.jpg)
Junghoo "John" Cho (UCLA Computer Science) 14
Topics
Combining multiple grades Weighting Efficient query processing
![Page 15: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/15.jpg)
Junghoo "John" Cho (UCLA Computer Science) 15
Weighting of Grades
Cheap(x) & NearUCLA(x) What if proximity is “more important” than
price? Assign weights to each atomic query
Cheap(x) = 0.2, weight = 1 NearUCLA(x) = 0.8, weight = 10 Proximity is 10 times more important than price Overall grade?
![Page 16: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/16.jpg)
Junghoo "John" Cho (UCLA Computer Science) 16
Formalization
m-atomic queries = (1, …, m) : weight of each atomic query
X = (x1, …, xm) : grades from each atomic query
f (x1, …, xm) : unweighted scoring function
f(x1, …, xm) : new weighted scoring function
What should f(x1, …, xm) be given ? Properties of f(x1, …, xm)?
![Page 17: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/17.jpg)
Junghoo "John" Cho (UCLA Computer Science) 17
Properties
P1: When all weights are equalf(1/m, …, 1/m)(x1, …, xm) = f (x1, …, xm)
P2: If an argument has zero weight, we can safely drop the argument f(1, …, m-1, 0) (x1, …, xm) = f(1, …, m-1)(x1, …, xm-1)
P3: f(X) should be locally linear f+(1-)’(x1, …, xm) =
f(x1, …, xm) + (1-) f’(x1, …, xm)
![Page 18: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/18.jpg)
Junghoo "John" Cho (UCLA Computer Science) 18
Local Linearity Example
1 = (1/2, 1/2), f1(X) = 0.22 = (1/4, 3/4), f2(X) = 0.4
If 3 = (3/8, 5/8) = 1/2 1+ 1/2 2
f3(X) = 1/2 f1(X) + 1/2 f2(X) = 0.3 Q: m-atomic queries. How many independent
weight assignments? A: m. Only m degrees of freedom
Very strong assumption Not too unreasonable, but no rationale
![Page 19: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/19.jpg)
Junghoo "John" Cho (UCLA Computer Science) 19
Theorem
1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +3·(3 - 4) f (x1 , x2 , x3) +…m· m · f (x1 , …, xm)is the only function that satisfies such properties
![Page 20: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/20.jpg)
Junghoo "John" Cho (UCLA Computer Science) 20
Examples
= (1/3, 1/3, 1/3) 1·(1/3-1/3) f (x1) + 2·(1/3-1/3) f (x1, x2) + 3·(1/3) f (x1 , x2 , x3)
= f (x1 , x2 , x3)
= (1/2, 1/4, 1/4) 1·(1/2-1/4) f (x1) + 2·(1/4-1/4) f (x1, x2) + 3·(1/4) f (x1 , x2 , x3)
= 1/4 f (x1) + 3/4 f (x1 , x2 , x3)
= (1/2, 1/3, 1/6) 1·(1/2-1/3) f (x1) + 2·(1/3-1/6) f (x1, x2) + 3·(1/6) f (x1 , x2 , x3)
= 1/6 f (x1) + 2/6 f (x1 , x2) + 3/6 f (x1 , x2 , x3)
![Page 21: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/21.jpg)
Junghoo "John" Cho (UCLA Computer Science) 21
Summary of Weighting
Question: different “importance” of grades = (1, …, m): weight assignment Uniqueness theorem
Local linearity and two other reasonable assumption
1·(1 - 2) f (x1) +2·(2 - 3) f (x1, x2) +…m· m · f (x1 , …, xm)
Linearity assumption questionable
![Page 22: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/22.jpg)
Junghoo "John" Cho (UCLA Computer Science) 22
Application?
Web page ranking PageRank & (Keyword1 & Keyword2 & …)
Should we use min()? min(keyword1, keyword2, keyword3,…) Would it be better than the cosine measure?
If PageRank is 10 times more important, should we use Fagin’s formula? 9/11 PR + 2/11 min(PR, min(keywords)) Would it be better than other ranking function?
Is Fagin’s formula practical?
![Page 23: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/23.jpg)
Junghoo "John" Cho (UCLA Computer Science) 23
Topics
Combining multiple grades Weighting Efficient query processing
![Page 24: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/24.jpg)
Junghoo "John" Cho (UCLA Computer Science) 24
Question
How can we process ranked queries efficiently? Top k answers for “Cheap(x) & NearUCLA(x)” Assume we have good scoring functions
How do we process traditional Boolean query? GPA > 3.5 & Dept = “CS”
What’s the difference? What is difficult compared to Boolean query?
![Page 25: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/25.jpg)
Junghoo "John" Cho (UCLA Computer Science) 25
Naïve Solution
Cheap(x) & NearUCLA(x)1. Read prices of all houses
2. Compute distances of all houses
3. Compute combined grades of all houses
4. Return the k-highest grade objects Clearly very expensive when database is
large
![Page 26: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/26.jpg)
Junghoo "John" Cho (UCLA Computer Science) 26
Main Idea
We don’t have to check all objects/tuples Most tuples have low grades and will not be
returned Basic algorithm
Check top objects from each atomic query and find the best objects
Question: How many objects should we see from each “atomic query”?
![Page 27: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/27.jpg)
Junghoo "John" Cho (UCLA Computer Science) 27
Architecture
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.78…
b: 0.9d: 0.9a: 0.75…
f (x1, x2, x3)
b: 0.78a: 0.75
How many to check?How to minimize it?
Sorted access Random access
any monotonic function
![Page 28: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/28.jpg)
Junghoo "John" Cho (UCLA Computer Science) 28
Three Papers
Fuzzy queries Optimal aggregation Minimal probing
![Page 29: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/29.jpg)
Junghoo "John" Cho (UCLA Computer Science) 29
Fagin’s Model
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.78…
b: 0.9d: 0.9a: 0.75…
f (x1, x2, x3)
Sorted access Sorted access Sorted access
![Page 30: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/30.jpg)
Junghoo "John" Cho (UCLA Computer Science) 30
Fagin’s Model
Sorted access on all streams Cost model: # objects accessed by sorted/random
accessescs s + cr r
Ignore the cost for “sorting” Reasonable when objects have been sorted already
Sorted index
Inappropriate when objects have not been sorted We have to compute grades for all objects Sorting can be costly
![Page 31: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/31.jpg)
Junghoo "John" Cho (UCLA Computer Science) 31
Main Question
How many objects to access? When can we stop?
A: When we know that we have seen at least k objects whose scores are higher than any unseen objects
![Page 32: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/32.jpg)
Junghoo "John" Cho (UCLA Computer Science) 32
Fagin’s First Algorithm
Read objects from each stream in parallel
Stop when k objects have been seen in common from all streams
Top answers should be in the union of the objects that we have seen
Why?
f (x1, x2, x3)
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.78…
b: 0.9d: 0.9a: 0.75…
k objects
![Page 33: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/33.jpg)
Junghoo "John" Cho (UCLA Computer Science) 33
Stopping Condition
Reason The grades of the k objects in the intersection is
higher than any unseen objects Proof
x: object in the intersection, y: unseen object y1 x1. Similarly yi xi for all i
f (y1, …, ym) f (x1, …, xm) due to monotonicity
![Page 34: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/34.jpg)
Junghoo "John" Cho (UCLA Computer Science) 34
Fagin’s First Algorithm
1. Get objects from each stream in parallel until we have seen k objects in common from all streams
2. For all objects that we have seen so far If its complete grade is not known, obtain
unknown grades by random access
3. Find the object with the highest grade
![Page 35: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/35.jpg)
Junghoo "John" Cho (UCLA Computer Science) 35
Example (k = 2)
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.5…
min(x1, x2)
d: 0.6 c: 0.2
0.6
a 0.9
d 0.9
0.2
0.85
b 0.8
c 0.7
0.5
0.6
0.2
x1 x2 min
0.85
0.5
a: 0.85d: 0.6
![Page 36: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/36.jpg)
Junghoo "John" Cho (UCLA Computer Science) 36
Performance
We only look at a subset of objects Ignoring high cost for random access, clearly
better than the naïve solution Total number of accesses
O(N(m-1)/m k1/m) assuming independent and random object order for each atomic query
E.g., O(N1/2 k1/2) if m = 2
![Page 37: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/37.jpg)
Junghoo "John" Cho (UCLA Computer Science) 37
Summary of Fagin’s Algorithm
Sorted access on all streams Stopping condition
k common objects from all streams
![Page 38: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/38.jpg)
Junghoo "John" Cho (UCLA Computer Science) 38
Problem of Fagin’s Algorithm Performance depends heavily on object orders in
the streams k = 1, min(x1, x2)
We need to read all objects Sorted access until 3rd objects and random access for all
remainder Can we avoid this pathological scenario?
b: 1a: 1c: 1d: 0e: 0
e: 1d: 1b: 1c: 0a: 0
![Page 39: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/39.jpg)
Junghoo "John" Cho (UCLA Computer Science) 39
New Idea
Let us read all grades of an object once we see it from a sorted access Do not need to wait until the streams give k
common objects Less dependent on the object order
When can we stop? Until we have seen k common objects from sorted
accesses?
![Page 40: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/40.jpg)
Junghoo "John" Cho (UCLA Computer Science) 40
When Can We Stop?
If we are sure that we have seen at least k objects whose grades are higher than those of unseen objects
How do we know the grades of unseen objects?
Can we predict the maximum grade of unseen objects?
![Page 41: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/41.jpg)
Junghoo "John" Cho (UCLA Computer Science) 41
Maximum Grade of Unseen Objects Assuming min(x1, x2), what will be the
maximum grade of unseen objects?
a: 1b: 0.9c: 0.8d: 0.7e: 0.6
e: 1d: 0.8b: 0.7c: 0.7a: 0.2
x1 < 0.8 and x2 < 0.7, so at most min(0.8, 0.7) = 0.7
Generalization?
![Page 42: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/42.jpg)
Junghoo "John" Cho (UCLA Computer Science) 42
Generalization
xi: the minimum grade from stream i by sorted access
f (x1, …, xm) is the maximum grade of unseen objects xi < xi for all unseen objects
f (x1, …, xm): monotonic x1
x1
x2
x2
![Page 43: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/43.jpg)
Junghoo "John" Cho (UCLA Computer Science) 43
Basic Idea of TA
We can stop when top k seen object grades are higher than the maximum grade of unseen objects Maximum grade of unseen objects: f (x1, …, xm)
![Page 44: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/44.jpg)
Junghoo "John" Cho (UCLA Computer Science) 44
Threshold Algorithm
1. Read one object from each stream by sorted access
2. For each object O that we just read Get all grades for O by random access If f (O) is in top k, store it in a buffer
3. If the lowest grade of top k object is larger than the threshold f (x1, …, xm) stop
![Page 45: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/45.jpg)
Junghoo "John" Cho (UCLA Computer Science) 45
f (0.9,0.9) = 0.9f (0.8,0.85) = 0.8f (0.7,0.5) = 0.5
Example (k = 2)
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.5…
min(x1, x2)
d: 0.6 c: 0.2
a 0.9
d 0.9
b 0.8
0.6 0.6
x1 x2 min
0.85 0.85
0.5 0.5
a: 0.85d: 0.6
c 0.7 0.2 0.2
f (1,1) = 1
![Page 46: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/46.jpg)
Junghoo "John" Cho (UCLA Computer Science) 46
Comparison of FA and TA?
TA sees fewer objects than FA TA always stops earlier than FA
When we have seen k objects in common, their grades are higher than the threshold
TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for
missing grades
TA requires bounded buffer space (k) At the expense of more random seeks
![Page 47: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/47.jpg)
Junghoo "John" Cho (UCLA Computer Science) 47
Comparison of FA and TA
TA can be better in general, but it may perform more random seeks
What if random seek is very expensive or impossible? Algorithm with no random seek possible?
![Page 48: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/48.jpg)
Junghoo "John" Cho (UCLA Computer Science) 48
Algorithm NRA
An algorithm with no random seek Isn’t random seek essential?
How can we know the grade of an object when some of its grades are missing?
![Page 49: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/49.jpg)
Junghoo "John" Cho (UCLA Computer Science) 49
Basic Idea
We may still compute the lower bound of an object, even if we miss some of its grades E.g., max(0.6, x) 0.6
We may also compute the upper bound of an object, even if we miss some of its grades E.g., max(0.6, x) 0.8 if x 0.8
If the lower bound of O1 is higher than the upper bound of other objects, we can return O1
![Page 50: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/50.jpg)
Junghoo "John" Cho (UCLA Computer Science) 50
Generalization
(x1, …, xm): the minimum grades from sorted access
Lower bound of object: 0 for missing grades When x3, x4 are missing, f (x1, x2, 0, 0) From monotonicity
Upper bound of object: xi for missing grades When x3, x4 are missing, f (x1, x2, x3, x4)
x3 x3, x4 x4, thus f (x1, x2, x3, x4) f (x1, x2, x3, x4)
![Page 51: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/51.jpg)
Junghoo "John" Cho (UCLA Computer Science) 51
NRA Algorithm
1. Read one object from each stream by sorted access. Assume (x1, …, xm) are the lowest grades from the streams
2. For each object O seen so far Update its upper/lower bounds by
Upper bound = use xi for missing grades Lower bound = use 0 for missing grades
3. If lower bounds of top k objects are larger than upper bounds of any other object, stop
![Page 52: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/52.jpg)
Junghoo "John" Cho (UCLA Computer Science) 52
AVG(0.5,0.7)=0.6
AVG(0.5,0.2)=0.35
AVG(0.3,0.7)=0.5
AVG(0.5,0.6) = 0.55AVG(0.3,0.2) = 0.25
Example (k = 2)
a: 0.9b: 0.5c: 0.3…
d: 0.7a: 0.6e: 0.2…
AVG(x1, x2)a 0.9
d 0.7
0.6
b 0.5
AVG(0,0.7)=0.35
AVG(0.3,0)=0.15
x1 x2 Lower Bound
AVG(0.9,0)=0.45
AVG(0.5,0)=0.25
a, d
c 0.3
0.2e AVG(0,0.2)=0.1
AVG(0.9,0.7)=0.8
AVG(0.3,0.2)=0.25
AVG(0.9,0.7)=0.8
AVG(0.5,0.6)=0.55
AVG(0.3,0.2)=0.25
Upper Bound
0.75 0.75
AVG(0.9,0.7) = 0.8
![Page 53: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/53.jpg)
Junghoo "John" Cho (UCLA Computer Science) 53
Properties of NRA
No random access We may return an object even if we don’t
know its grade We may only know its lower bound
We need to constantly update the upper bounds of objects As threshold value decreases
![Page 54: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/54.jpg)
Junghoo "John" Cho (UCLA Computer Science) 54
Chang’s View
Computing grades can be expensive Sorting is expensive Minimize sorted access
![Page 55: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/55.jpg)
Junghoo "John" Cho (UCLA Computer Science) 55
Chang’s Model
Sorted access on one stream and random access on the remaining streams At least one sorted access necessary to
“discover” objects Cost model: # of random accesses Reasonable when the objects are not sorted
for some streams
![Page 56: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/56.jpg)
Junghoo "John" Cho (UCLA Computer Science) 56
Chang’s Model
a: 0.9b: 0.8c: 0.7…
d: 0.9a: 0.85b: 0.78…
b: 0.9d: 0.9a: 0.75…
f (x1, x2, x3)
Sorted access
Random access
![Page 57: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/57.jpg)
Junghoo "John" Cho (UCLA Computer Science) 57
Chang’s Solution
Main Idea? Probe only necessary attributes A probe is necessary iff we cannot find the
right answer without it Which probe is necessary?
![Page 58: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/58.jpg)
Junghoo "John" Cho (UCLA Computer Science) 58
Necessary Probes
Assume attribute probe order is fixed Assume min() is the scoring function Assume the threshold (grade of kth highest
object) is 0.7 a : (0.9, 0.3, 0.2)
Is the the second probe necessary? b : (0.5, 0.7, 0.3)
Is the second probe necessary? Is the necessity dependent on algorithm?
![Page 59: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/59.jpg)
Junghoo "John" Cho (UCLA Computer Science) 59
Observation
Probe necessity is independent of algorithm Purely dependent on the dataset Assuming probe order is fixed
How do we find the necessary probes? When the upper bound of the grade of object O goes below
the threshold, no more probe is necessary from the object
How do we find the threshold value? We know the upper bound of the threshold Threshold upper bound =
kth upper bound of grades (gk)…
Upper bound of Object grades
k
g1
gk
![Page 60: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/60.jpg)
Junghoo "John" Cho (UCLA Computer Science) 60
Algorithm MPro
As long as we probe objects with grades above the threshold upper bound, we are safe
Q: a priority queue for the upper bound grades of objects
Pick the top object O from Q Probe the next attribute of O Stop if we have the complete grade for the top k
objects in Q
![Page 61: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/61.jpg)
Junghoo "John" Cho (UCLA Computer Science) 61
Property of MPro
MPro is optimal in the exact sense (not in the big O sense) All probes are necessary Assuming we need to compute the complete grades for all
returned objects Assuming object probing order fixed
No other algorithm can beat MPro Does it work for max()?
Performance depends on the scoring function Good only when the upper bound is “tight”
![Page 62: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/62.jpg)
Junghoo "John" Cho (UCLA Computer Science) 62
Other Issues for MPro
How to select the attribute probing order Mpro is optimal given a particular probing order Attribute probing order affects performance
significantly Probe order estimation from sampling
How to parallelize Mpro Probe top k objects simutaneously
![Page 63: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/63.jpg)
Junghoo "John" Cho (UCLA Computer Science) 63
Summary
Efficient processing of ranked queries Sorted access Random access
FA: k common objects TA: threshold value NRA: upper and lower bounds MPro: necessary probe principle
![Page 64: CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary](https://reader030.vdocuments.site/reader030/viewer/2022032523/56649d805503460f94a63ccb/html5/thumbnails/64.jpg)
Junghoo "John" Cho (UCLA Computer Science) 64
Hints on Paper Writing
The goal of a paper is to be read and used by other people Should be easy to understand
Tricky balance
How to make a paper easy to read? Explicitly specify your assumptions
Readers do not know what you think! Use examples Run experiments