when is nearest neighbors indexable? uri shaft (oracle corp.) raghu ramakrishnan (uw-madison)
TRANSCRIPT
![Page 1: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/1.jpg)
When Is Nearest Neighbors Indexable?
Uri Shaft (Oracle Corp.)Raghu Ramakrishnan (UW-
Madison)
![Page 2: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/2.jpg)
Motivation -Scalability Experiments
• Dozens of papers describe experiments about index scalability with increased dimensions.– Constants are:
• Number of data points• Data and Query distribution• Index structure / search algorithm
– Variable:• Number of dimensions
– Measurement:• Performance of index.
![Page 3: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/3.jpg)
Example From PODS 1997
![Page 4: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/4.jpg)
Example From PODS 1997
![Page 5: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/5.jpg)
Motivation
• In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality
• We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.
![Page 6: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/6.jpg)
Historical Context
• Continues work done in “When Is Nearest Neighbors Meaningful?” (ICDT 1999)
• Previous work about behavior of distance distributions.
• This work about behavior of indexing structures under similar conditions.
![Page 7: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/7.jpg)
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
![Page 8: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/8.jpg)
Vanishing Variance
• Same definition used in ICDT 99 work (although not named in that work)
• In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small.
• We use the same result here.
![Page 9: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/9.jpg)
Vanishing Variance• A scalability experiment contains a
series of workloads W1,W2,…,Wm,…– m is the number of dimensions– each workload W1 has n data points
and a query point (same distribution)– Distance distribution marked as Dm
• Vanishing Variance:
0)(
varlim
m
mm DE
D
![Page 10: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/10.jpg)
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
![Page 11: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/11.jpg)
Convex Description Index• Data points distributed to buckets (e.g.
disk pages). Access to a buckets is “all or nothing”. We allow redundancy. A bucket contains at least two data points.
• Each bucket associated with a description – a convex region containing all data points in the bucket.
• Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor.
• Cost of search is the number of data points retrieved.
![Page 12: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/12.jpg)
Example: R-Tree• Buckets are disk pages. Under normal
construction buckets contain more than two data points each.
• Bucket descriptions are convex and contain all data points (Bounding Rectangles).
• Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).
![Page 13: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/13.jpg)
Convex Description Indexes
• All R-Tree variants• X-Tree• M-Tree• kdb-Tree• SS-Tree and SR-Tree• Many more
![Page 14: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/14.jpg)
Other indexes (non-CD)
• Probability structures (P-Tree, VLDB 2000)– Access based on clusters. A near
enough bucket may not be accessed
• Projection index (like VA-file)– Compression structures. – All data points accessed in pieces, not
in buckets.
![Page 15: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/15.jpg)
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
![Page 16: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/16.jpg)
Indexing Theorem
• If:– Scalability experiment uses a series of
workloads with Vanishing Variance– The distance metric is Euclidean– The indexing structure is Convex
Description
• Then:– The expected cost of a query converges to
the number of data points – I.e., a linear scan of the data
![Page 17: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/17.jpg)
Sketch of Proof• Because of Vanishing Variance, the
ratio of distances between various query and data points becomes arbitrarily close to 1.
• When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:
![Page 18: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/18.jpg)
Bucket
Q
D1 D2Y
Distances of Q, D1, D2,…, Dn are about the same.
Distance of Q to Y is much smaller
Therefore, distance of Q to data bucket is less than distance to nearest neighbor.
![Page 19: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/19.jpg)
Contents
• Vanishing Variance property• Convex Description index
structures• Indexing Theorem
– The performance of CD index does not scale for VV workloads using Euclidean distances.
• Conclusion• Future Work
![Page 20: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/20.jpg)
Conclusion• Dozens of papers describe
experiments about index scalability with increased dimensions.
• We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability.
• We proved that many of these experiments do not scale in dimensionality.
![Page 21: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/21.jpg)
Conclusion
• Use this theorem to to channel indexing research into more useful and practical avenues
• Review previous results accordingly.
![Page 22: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/22.jpg)
Future Work
• Remove restriction of at least two data points in bucket. – Easy exercise, need to take into
account the cost of traversing a hierarchical data structure.
• Investigate other Lp metrics• Investigate projection indexes
using Euclidean metric (looks like they do not scale either)
![Page 23: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/23.jpg)
• Find scalable indexing structure for Uniform data and L metric– Hint: use compression
• Find number of data points needed for R-Tree to be practical on uniform data, L2 metric.– Approx:
Future Work
mFn 3
![Page 24: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)](https://reader036.vdocuments.site/reader036/viewer/2022062417/5515ed02550346dd6f8b523d/html5/thumbnails/24.jpg)
Questions