![Page 1: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/1.jpg)
Indexing inexact proximity search withdistance regression in pivot space
Ole Edsberg Magnus Lie Hetland
Department of Computer and Information ScienceNorwegian University of Science and Technology
Similarity Search and Applications, 2010
![Page 2: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/2.jpg)
The problem
Our task: Speed up proximity search in cases where:I Distance calculation is expensive.I Distance-based indexing is needed, because the contents
of the data objects cannot be used in the index.I Some search inexactness is acceptable, meaning we are
allowed to trade som search accuracy in return for reducedsearch computation cost.
Our contribution: A new indexing scheme that in some casesprovides better computation / accuracy trade-offs than thecompetition. It also has some draw-backs.
![Page 3: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/3.jpg)
Related work
E. Chávez, K. Figueroa, and G. Navarro.Effective proximity retrieval by ordering permutations.IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(9):1647–1658, 2008.
Takeaway:I How to perform inexact search by ordering the database
according to promise value function.I A specific promise value function, which we use as a
baseline in our experiments.I Experimental setup.
![Page 4: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/4.jpg)
Pivot space
Pivot set:P = (p1,p2, . . . ,pm)
Mapping object o to pivot space.
Φ(o) = (d(o,p1),d(o,p2), . . . ,d(o,pm))
![Page 5: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/5.jpg)
The baseline: Permutation based promise values
The promise value for indexed object u with respect to query qis the correlation (rank correlation coefficient, Spearman’s ρ)between the ordering permutation of Φ(u) and that of Φ(q).
![Page 6: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/6.jpg)
What makes a good promise value function?
![Page 7: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/7.jpg)
Two ideas for promise value functions
Distance estimated̂(u,q)
Probability of inclusion
Pr(d(u,q) ≤ r)
![Page 8: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/8.jpg)
Uncertainty in distance estimates
d̂(u,q)
distance to query
prob
abili
tyde
nsity
![Page 9: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/9.jpg)
Should red or blue object be visited first?
r
distance to query
prob
abili
tyde
nsity
![Page 10: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/10.jpg)
Should red or blue object be visited first?
r
distance to query
prob
abili
tyde
nsity
![Page 11: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/11.jpg)
Linear model of distance for indexed object u
d(u,q) = β〈u,0〉 +m∑
i=1
β〈u,i〉d(q,pi) + εu ,
![Page 12: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/12.jpg)
Regression-based index (one model per object!)
With n objects to index and m pivots, an n × (m + 1) matrix:β̂〈u1,0〉, β̂〈u1,1〉, . . . , β̂〈u1,m〉β̂〈u2,0〉, β̂〈u2,1〉, . . . , β̂〈u2,m〉
. . .
β̂〈un,0〉, β̂〈un,1〉, . . . , β̂〈un,m〉
I The coefficients can be discretized to save space.I Plus 2n + 2 additional values if we want probabilities.
![Page 13: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/13.jpg)
Building the index
1. Select n′ training queries from the objects to be indexed.2. For each training query q′, calculate Φ(q′) .3. For each object to be indexed u:
3.1 For each training query q′, calculate d(u,q′).3.2 Solve the least squares linear regression problem to find
the m + 1 coefficients βu.3.3 Store the coefficients in the index.3.4 If we want probabilities, also store σ̂u, the estimated
standard deviation of εu:
σ̂u =
√∑n′
i=1(d(u,q′i )− d̂(u,q′
i ))2
n′ −m − 1
4. If we want probabilities for the k -NN queries, also store theestimated search radius for each k .
(Detail glossed over in this presentation: we exclude u from thetraining queries used to fit its own model.)
![Page 14: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/14.jpg)
Distance estimates as promise values
d̂(u,q) = β̂〈u,0〉 +m∑
i=1
β̂〈u,i〉d(q,pi)
![Page 15: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/15.jpg)
Probability-based promise values
r − d̂(u,q)
σ̂u
I Depends on a lot of assumptions.I We also ignore the consequence of excluding u from its
own training queries.
![Page 16: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/16.jpg)
Storage costs
With n objects to index and m pivots,I For distance estimates: n(m + 1) coefficients. (Can be
discretized at the cost of some accuracy.)I For probabilities: 2n + 2 additional values.
I Permutation-based index: nmdlog2(m)e bits in total.
![Page 17: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/17.jpg)
Index building costs
With n objects to index, m pivots and n′ training queries,I Regression-based scheme: n′(n + m) distance calculations
plus the solution of n linear regression problems.I Permutation-based scheme: nm distance calculations, plus
some sorting.
![Page 18: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/18.jpg)
Experimental setup
I We borrowed the experimental setup from Chávez,Figueroa & Navarro’s evaluation of the permutation-basedscheme.
I Pivots selected randomly.I Also evaluated versions with pivot set reduced to make
storage cost equal to permutation-based index.I Both synthetic and real-world data sets, but results on
real-world data may have more validity.
![Page 19: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/19.jpg)
Evaluating promise value functions:computation / accuracy trade-offs
Computation cost (%)
Sea
rch
accu
racy
(%)
Perfect orderingsHypothetical AHypothetical B
Random orderings
(Average over many queries.)
![Page 20: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/20.jpg)
Results on normalized edit distance (NED)
10−1 100 101 1020
50
100
% of max dist. calc.
%re
trie
ved
(a) 32 pivots
10−1 100 101 102
20
40
60
80
100
% of max dist. calc.
%re
trie
ved
(b) 128 pivots
Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)
Distance estimates (12 bits fair)
![Page 21: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/21.jpg)
Results on documents (TREC)
100 101 102
40
60
80
100
% of max dist. calc.
%re
trie
ved
(a) Range-queries
0 0.1 0.20
50
100
wall clock time (seconds)
%re
trie
ved
(b) 5-NN queries
Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)
Distance estimates (12 bits fair)
![Page 22: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/22.jpg)
Results on face images (FERET)
0 10 200
0.1
0.2
k%of
max
dist
ance
calc
ulat
ions (a)
0 10 20
1
2
3·10−3
k
wal
lclo
cktim
e(s
econ
ds) (b)
Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)
Distance estimates (12 bits fair)
![Page 23: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/23.jpg)
Why were the probability-based promise values sometimesworse, and never better, than the distance estimates?
![Page 24: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some](https://reader034.vdocuments.site/reader034/viewer/2022042712/5f9238b78b90a33406464c7b/html5/thumbnails/24.jpg)
Conclusion
Regression-based scheme show some promise, but:I Takes a lot of time to build the index.I Vulnerable to deviation from assumptions.