similarity search on bregman divergence, towards non- metric indexing zhenjie zhang, beng chi ooi,...
TRANSCRIPT
Similarity Search on Bregman Divergence, Towards Non-Metric Indexing
Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung
Metric v.s. Non-Metric
Euclidean distance dominates DB queries
Similarity in human perception
Metric distance is not enough!
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 2
Outline
Bregman Divergence
Solution
Basic solution
Better pruning bounds
Query distribution
Experiments
Conclusion
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 3
Bregman Divergence
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 4
q pEuclidean dist.
convex function f(x)
Bregman divergence
Df(p,q)
(q,f(q))
(p,f(p))
h
Bregman Divergence
Mathematical Interpretation
The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 5
original f(x) first order Taylor expansion of f(x) at q
Bregman Divergence
General Properties
Uniqueness
A function f(x) uniquely decides the Df(p,q)
Non-Negativity
Df(p,q)≥0 for any p, q
Identity
Df(p,p)=0 for any p
Symmetry and Triangle Inequality Do NOT hold any more
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 6
Examples
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 7
Distance f(x) Df(p,q) Usage
KL-Divergence x logx p log (p/q) distribution, color histogram
Itakura-Saito Distance
-logx p/q-log (p/q)-1 signal, speech
Squared Euclidean
x2 (p-q)2 traditional queries
Von-Nuemann Entropy
tr(X log X – X) tr(X logX – X logY – X + Y)
symmetric matrix
Why in DB system?
Database application
Retrieval of similar images, speech signals, or time series
Optimization on matrices in machine learning
Efficiency is important!
Query Types
Nearest Neighbor Query
Range Query
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 8
Euclidean Space
How to answer the queries
R-Tree
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 9
Euclidean Space
How to answer the queries
VA File
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 10
Our goal
Re-use the infrastructure of existing DB system to support Bregman divergence
Storage management
Indexing structures
Query processing algorithms
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 11
Outline
Bregman Divergence
Solution
Basic solution
Better pruning bounds
Query distribution
Experiments
Conclusion
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 12
Basic Solution
Extended Space
Convex function f(x) = x2
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 13
point A1 A2
p 0 1
q 0.5 0.5
r 1 0.8
t 1.5 0.3
point A1 A2 A3
p+ 0 1 1
q+ 0.5 0.5 0.5
r+ 1 0.8 1.64
t+ 1.5 0.3 3.15
Basic Solution
After the extension
Index extended points with R-Tree or VA File
Re-use existing algorithms with new lower and upper bound computation
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 14
How to improve?
Reformulation of Bregman divergence
Tighter bounds are derived
No change on index construction or query processing algorithm
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 15
A New Formulation
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 16
q p
Df(p,q)+Δ
query vector vq
D*f(p,q)
h
h’
Math. Interpretation
Reformulation of similarity search queries
k-NN query: query q, data set P, divergence Df
Find the point p, minimizing
Range query: query q, threshold θ, data set P Return any point p that
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 17
Naïve Bounds
Check the corners of the bounding rectangles
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 18
Tighter Bounds
Take the curve f(x) into consideration
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 19
Query distribution
Distortion of rectangles
The difference between maximum and minimum distances from inside the rectangle to the query
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 20
Can we improve it more?
When Building R-Tree in Euclidean space
Minimize the volume/edge length of MBRs
Does it remain valid?
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 21
Query distribution
Distortion of bounding rectangles Invariant in Euclidean space (triangle inequality)
Query-dependent for Bregman Divergence
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 22
Utilize Query Distribution
Summarize query distribution with O(d) real number
Estimation on expected distortion on any bounding rectangle in O(d) time
Allows better index to be constructed for both R-Tree and VA File
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 23
Outline
Bregman Divergence
Solution
Basic solution
Better pruning bounds
Query distribution
Experiments
Conclusion
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 24
Experiments
Data Sets KDD’99 data
Network data, the proportion of packages in 72 different TCP/IP connection Types
DBLP data
Use co-authorship graph to generate the probabilities of the authors related to 8 different areas
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 25
Experiment
Data Sets
Uniform Synthetic data Generate synthetic data with uniform distribution
Clustered Synthetic data Generate synthetic data with Gaussian Mixture Model
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 26
Experiments
Methods to compare
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 27
Basic Improved Bounds
Query Distribution
R-Tree R R-B R-BQ
VA File V V-B V-BQ
Linear Scan LS
BB-Tree BBT
Experiments
Index Construction Time
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 28
Experiments
Varying dimensionality
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 29
Experiments
Varying dimensionality (cont.)
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 30
Experiments
Varying k for nearest neighbor query
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 31
Conclusion
A general technique on similarity for Bregman Divergence
All techniques are based on existing infrastructure of commercial database
Extensive experiments to compare performances with R-Tree and VA File with different optimizations
23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 32
Acknowledgment
Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R-252-000-376-279.
Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.