a quantitative analysis and performance study for similar- search methods in high- dimensional space...

31
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Upload: aubrey-tucker

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

Presented By Umang Shah Koushik

Page 2: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Introduction

Sequential Scan always out perform whenever the dimension is greater then 10 or higher.

Any method of clustering or data space partition method fail to handle HDVS beyond a certain limit.

VA files is proposed to do the inevitable sequential scan more efficiently. Performance increases with dimensions.

Page 3: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Assumptions and Notation

Assumption 1-Data and Metric

• Unit hypercube

• Distances

Assumption 2-Uniformity and Independence

• Data and query points are uniformly distributed

• Dimensions are independent.

Page 4: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

NN, NN-distance, NN-sphere

Page 5: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Probability and Volume Computations

Page 6: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

The Difficulties of High Dimensionality

Number of partitions. Data space is sparsely populated Spherical range queries Exponentially growing DB size Expected NN-Distance.

Page 7: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Number of partitions

2d partitions Assume N = 106 points. For d = 100, there are 2100 ≈ 1030 partitions. Too many partitions are empty.

Page 8: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Data space is sparsely populated

0.95^100 = 0.0059 At d = 100, even a hypercube

of side 0.95 can cover only 0.59% of the data space.

Page 9: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Spherical range queries

The largest spherical query.

Page 10: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Exponentially growing DB size

At least one point falls into the largest possible sphere.

Page 11: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Expected NN-Distance

The NN distance grows steadily with d.

Page 12: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

General Cost Model

The Probability that the ith block is visited,

Expected number of blocks visited

If we assume m objects per block,

Is Mvisit > 20%?

Page 13: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Space-Partitioning Methods

is independent of d.

Space consumption – 2d.So split is done in d’ dimensions only.

E [nndist] increases with d

When E [nndist] is greater than lmax the entire database is accessed.

Page 14: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Data-Partitioning Methods

Rectengular MBRS • R* tree,X tree,SR tree

Spherical MBRS• TV tree,M tree,SR tree

General partitioning and Clustering schemes

Page 15: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Rectangular MBRs

Page 16: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Spherical MBRS

Page 17: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

General Partitioning and ClusteringSchemes

Assumptions• A cluster is

characterized by a geometrical form (MBR) that covers all cluster points

• Each cluster contains at least 2 points

• The MBR of a cluster is convex.

Page 18: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Vector Approximation File Basic Idea: Technique Specially Designed

For Similarity Search Object Approximation Vector Data Compression

Page 19: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Notations

Page 20: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Lower bound ,upper bound

Page 21: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

How it is done

The data is divided in to 2^b rectangular cells

Cells are arranged in form of grid Entire file is scanned at the time of query

Page 22: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Compression Vector

For each dimension a small number bitsb [i] is assigned. The sum b[i] is b The data space is divided in 2^d hyper

rectangles Each data point is approximated by the bit

string of the cell Only the boundary points of each data set

needs to be stored

Page 23: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Compression Vector

Normally bits chosen for each dimension vary from 4 to 8

Typically

bi = l, b = d *l, l = 4.. .8

Page 24: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Example:

Page 25: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Two probability associated with the VA files

Page 26: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Filtering Step

Simple Search Algorithm An Array of k elements is maintained This array is maintained in sorted order File is sequentially searched. If the element’s lower bound < k th

element upper bound The actual distance are calculated

Page 27: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Filtering Step

Near Optimal search algorithm Done in two steps While scanning through the file Step1-Calculate the kth largest upper bound

Encountered so far If new element has lower bound greater then

then discard it

Page 28: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Filtering Step

Step2-The elements remaining in step1 are collected

The elements in increasing order of lower bound are visited till it is >= to the kth element upper bound

Page 29: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Performance

Add Two Graphs Of Performance

Page 30: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Performance

Page 31: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik

Conclusion

All approaches to nearest-neighbor search in HDVSs ultimately become linear at high dimensionality.

The VA-File method can out-perform any other method known to the authors.