top-k aggregation using intersections - stanford cs...
TRANSCRIPT
![Page 1: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/1.jpg)
Top-k Aggregation Using Intersections
Ravi Kumar
Kunal Punera
Torsten Suel
Sergei Vassilvitskii
Yahoo! Research
Yahoo! Research
Yahoo! Research / Brooklyn Poly
Yahoo! Research
![Page 2: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/2.jpg)
Top-k retrieval
Given a set of documents:
And a query: “New York City”
Find the k documents best matching the query.
2
Doc 1
Doc 2Doc 4
Doc 3Doc 5
Doc 6
![Page 3: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/3.jpg)
Top-k retrieval
Given a set of documents:
And a query: “New York City”
Find the k documents best matching the query.
Assume: decomposable scoring function:
Score(“New York City”) = Score(“New”) + Score(“York”)+Score(“City”).
3
Doc 1
Doc 2Doc 4
Doc 3Doc 5
Doc 6
![Page 4: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/4.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval.
Create posting lists:
4
Doc ID Score
![Page 5: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/5.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval.
Create posting lists:
Query: New York City
New...
York...
City...
5
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
Doc ID Score
![Page 6: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/6.jpg)
Introduction: Postings Lists
(Offline) Sort each list by decreasing score.
Query: New York City
New...
York...
City...
Retrieval: Start with document with highest score in any list.
Look up its score in other lists.
Top:
6
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 5.2+3.1+0.2=8.5
![Page 7: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/7.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
Continue with next highest score.
Top: Candidate:
7
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 10 4.1+2.0+0.0 = 6.1
![Page 8: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/8.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
Continue with next highest score.
Top: Candidate:
7
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 10 4.1+2.0+0.0 = 6.1
![Page 9: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/9.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
Continue with next highest score.
Top: Candidate:
8
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 5 4.0+0.5+0.1=4.6
![Page 10: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/10.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
Continue with next highest score.
Top: Candidate:
8
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 5 4.0+0.5+0.1=4.6
![Page 11: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/11.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
When can we stop?
Top: Best Possible Remaining:
9
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 * 3.3+1.5+1.0=5.8
![Page 12: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/12.jpg)
Introduction: Postings Lists
Data Structures behind top-k retrieval:
Arrange each list by decreasing score.
Query: New York City
New...
York...
City...
When can we stop?
Top: Best Possible Remaining:
9
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.5 * 3.3+1.5+1.0=5.8
![Page 13: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/13.jpg)
Threshold Algorithm
10
Threshold Algorithm (TA)
– Instance optimal (in # of accesses) [Fagin et al]
– Performs random accesses
No-Random-Access Algorithm (NRA)
– Similar to TA
– Keep a list of all seen results
– Also instance optimal
![Page 14: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/14.jpg)
Introducing bi-grams
11
![Page 15: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/15.jpg)
Introducing bi-grams
Certain words often occur as phrases. Word association:
11
![Page 16: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/16.jpg)
Introducing bi-grams
Certain words often occur as phrases. Word association:
– Sagrada ...
11
![Page 17: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/17.jpg)
Introducing bi-grams
Certain words often occur as phrases. Word association:
– Sagrada ...
– Barack ...
11
![Page 18: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/18.jpg)
Introducing bi-grams
Certain words often occur as phrases. Word association:
– Sagrada ...
– Barack ...
– Latent Semantic...
11
![Page 19: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/19.jpg)
Introducing bi-grams
Certain words often occur as phrases. Word association:
– Sagrada ...
– Barack ...
– Latent Semantic...
Pre-compute posting lists for intersections
– Note, this is not query-result caching
Tradeoffs:
– Space: extra space to store the intersection (though it’s smaller)
– Time: Less time upon retrieval
12
![Page 20: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/20.jpg)
Bi-grams & TA
Query: New York City
All aggregations -- 6 lists.
[New] [York] [City] [New York] [New City] [York City]
13
![Page 21: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/21.jpg)
Bi-grams & TA
Query: New York City
All aggregations -- 6 lists.
[New] [York] [City] [New York] [New City] [York City]
New
York
City
NY
NC
YC
14
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
![Page 22: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/22.jpg)
Bi-grams & TA
Query: New York City
All aggregations -- 6 lists.
[New] [York] [City] [New York] [New City] [York City]
New
York
City
NY
NC
YC
Top:
15
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5
![Page 23: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/23.jpg)
Bi-grams & TA
Query: New York City
All aggregations -- 6 lists.
[New] [York] [City] [New York] [New City] [York City]
New
York
City
NY
NC
YC
Top:
16
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5 Can we stop now?
![Page 24: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/24.jpg)
TA Bounds Informal
New
York
City
NY
NC
YC
Top:
Bounds on any unseen element:
N + Y + C = 10.1
17
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5
![Page 25: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/25.jpg)
TA Bounds Informal
New
York
City
NY
NC
YC
Top:
Bounds on any unseen element:
N + Y + C = 10.1 NY + C = 6.5
18
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5
![Page 26: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/26.jpg)
TA Bounds Informal
New
York
City
NY
NC
YC
Top:
Bounds on any unseen element:
N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1
19
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5
![Page 27: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/27.jpg)
TA Bounds Informal
New
York
City
NY
NC
YC
Top:
Bounds on any unseen element:
N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1 1/2 (NY + YC + NC) = 7.45
20
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 2.5
3 1.5
3 1.0
10 2.0
5 0.6
9 8.5
![Page 28: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/28.jpg)
TA Bounds Informal
21
9 5.2
10 4.1
10 2.0
5 4.0
9 3.1
3 1.5
7 3.3
7 1.0
7 1.0
3 1.0
5 0.5
9 0.2
10 0.0
1 0.2
5 0.1
9 8.3
9 5.4
10 6.1
5 4.5
7 4.3
9 3.3
7 4.3
5 4.1
7 2.0
10 4.1
3 1.5
3 1.0
5 0.6
New
York
City
NY
NC
YC
Top:
Bounds on any unseen element:
N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1 1/2 (NY + YC + NC) = 7.45
Thus best element has score < 6.5. So we are done!
9 8.5
3 2.5 10 2.0
![Page 29: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/29.jpg)
TA: Bounds Formal
Can we write the bounds on the next element?
: score of document x in list i.
: bound on the score in list i (score of next unseen document)
Combinations: bound on
Simple LP for bound on unseen elements:
In theory: Easy! Just solve an LP every time.
In reality: You’re kidding, right?
22
xi
bi
bij xi + xj
max
!
i
xi
xi ! bi
xi + xj ! bij
![Page 30: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/30.jpg)
Solving the LP
Need to solve the LP: Same as solving the dual
23
max
!
i
xi
xi ! bi
xi + xj ! bij
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
![Page 31: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/31.jpg)
The dual as a graph
24
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
Add one node for each with weight yi
Add one edge for each with weight yij
5.2
5.1
3.31.2
3.7
6.1
bij
bi
1.2
5.4
3.3
4.2
![Page 32: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/32.jpg)
The dual as a graph
24
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
Add one node for each with weight yi
Add one edge for each with weight yij
5.2
5.1
3.31.2
3.7
6.1
bij
bi
1.2
5.4
3.3
4.2Single Lists
![Page 33: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/33.jpg)
The dual as a graph
24
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
Add one node for each with weight yi
Add one edge for each with weight yij
5.2
5.1
3.31.2
3.7
6.1
bij
bi
1.2
5.4
3.3
4.2
Paired Lists
![Page 34: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/34.jpg)
The dual as a graph
24
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
Add one node for each with weight yi
Add one edge for each with weight yij
Goal: select a (fractional) subset of edges and vertices, so that each vertex has (in total) a weight of 1 selected.
5.2
5.1
3.31.2
3.7
6.1
bij
bi
1.2
5.4
3.3
4.2
![Page 35: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/35.jpg)
The dual as a graph
25
min!
yijbij +!
yibi
yi +!
j
yij ! 1
yi, yij ! 0
Add one node for each with weight yi
Add one edge for each with weight yij
Goal: select a (fractional) subset of edges and vertices, so that each vertex has (in total) a weight of 1 selected.
5.2
5.1
3.31.2
3.7
6.1
bij
bi
1.2
5.4
3.3
4.2
![Page 36: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/36.jpg)
Solving the problem...
Goal: select a subset of edges and vertices, so that each vertex has a weight of 1 selected.
This looks like the classical edge cover problem only with vertices.
26
5.2
5.1
3.31.2
3.7
6.1
1.2
5.4
3.3
4.2
![Page 37: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/37.jpg)
Solving the problem...
Goal: select a subset of edges and vertices, so that each vertex has a weight of 1 selected.
This looks like the classical edge cover problem only with vertices.
We show how to solve this problem by computing min cost matching.
Running time: O(nm)
Checking all combinations: O(n!)
27
5.2
5.1
3.31.2
3.7
6.1
1.2
5.4
3.3
4.2
![Page 38: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/38.jpg)
Outline
Introduction to TA
Solving the ‘upper bound’ problem
Empirical Results
Conclusion
28
![Page 39: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/39.jpg)
Empirical Analysis
Datasets:
– Trec (25M pages), 100k queries
– Yahoo! (16M pages), 10k queries (random subset in each)• result caching enabled
Metrics:
– Number of Random and Sequential Accesses
– Index size
Which bigrams to select?
– Query oblivious manner
– Greedily based on size of intersection versus size of original lists
29
![Page 40: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/40.jpg)
Empirical Results
Baseline: traverse full list
INT: Use intersection lists, but still no Early Termination
ET: Use early termination, but without intersection lists
ET + INT: Use both early termination & intersection lists
Total index growth: 25%
30
0
15,000
30,000
45,000
60,000
Number of sequential accesses vs. Algorithm
Baseline INT ET ET + INT
Accesses
![Page 41: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/41.jpg)
Empirical Results (2)
Immediate benefit, but diminishing returns as extra intersections added.
31
0
4,500
9,000
13,500
18,000
Number of sequential accesses vs. Index size
0% 25% 50% 100%Index size increase
Accesses
![Page 42: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/42.jpg)
Results (2)
We prove that in worst case we must examine all of the lists to find the bound. (Otherwise not instance-optimal)
But is this just a theoretical result?
What if you use a simpler heuristics that focus only on intersection lists?
– For 89% of the queries:• Average savings 4500 random accesses
– For the 11% of the remaining queries• Average cost 127,000 random accesses
So the worst case does occur in practice.
32
![Page 43: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/43.jpg)
Conclusions
Give a formal analysis of how to use pre-aggregated posting lists
– Solving an LP is unreasonable
Show empirically that a simple selection rule for intersections gives performance improvements.
Many questions remain:
– Extending results to tri-grams (Solving hyperedge cover)
– Better ways of selecting intersections
– ...
33
![Page 44: Top-k Aggregation Using Intersections - Stanford CS Theorytheory.stanford.edu/~sergei/slides/wsdm09-topk.pdfRavi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research](https://reader034.vdocuments.site/reader034/viewer/2022042417/5f33325e4929cd27ed0d1533/html5/thumbnails/44.jpg)
Thank you