preserving-ignoring transformation based index for...

Preserving-Ignoring Transformation based Index forApproximate k Nearest Neighbor Search

Gang Hu Jie Shao Dongxiang Zhang Yang Yang Heng Tao ShenSchool of Computer Science and Engineering

University of Electronic Science and Technology of [email protected] {shaojie,zhangdo,yang.yang,shenhengtao}@uestc.edu.cn

Abstract—Locality sensitive hashing (LSH) and its variantsare widely used for approximate kNN (k nearest neighbor)search in high-dimensional space. The success of these techniqueslargely depends on the ability of preserving kNN information.Unfortunately, LSH only provides a high probability that nearbypoints in the original space are projected into nearby region ina new space. This potentially makes many false positives andfalse negatives resulting from unrelated points. Many extensionsof LSH aim to alleviate the above issue by improving the distancepreserving ability. In this paper, we abound improving LSHfunction but propose a novel idea to enhance the performanceby transforming the original data to a new space before applyingLSH. A preserving-ignoring transformation (PIT) function satis-fying some rigorous conditions can be used to convert originalpoints to an interim space with strict distance preserving-ignoringcapacity. Based on this property, a linear order is utilized to buildan efficient index structure in the interim space. Finally, LSHcan be applied to candidate set searched by our index structurefor final results. Experiments are conducted and the proposedapproach performs better than state-of-the-art methods SK-LSH,DSH and NSH in terms of both accuracy and efficiency.

I. INTRODUCTION

For many database applications k nearest neighbor (kNN)search is fundamental, such as similarity search in multi-media databases. To bypass the prohibitive cost of findingexact answers in high-dimensional space, approximate nearestneighbor (ANN) search to find points close enough to a querypoint instead of the closest one has become popular, andlocality sensitive hashing (LSH) [1] is the most promisingsolution. Usually, a hash function projects one point into anumerical value. Formally, such hash functions can be definedas h : D

d −→ Z, where Dd represents the domain of original

space and Z represents the domain of integers. For the propertyof distance preserving in LSH, nearby points in the originalspace are projected into nearby region in a new space with highprobability. In order to relieve the effect of false positives andfalse negatives, b randomly-chosen LSH functions are usedtogether to generate a compound hash key with b elements foreach point to distinguish them from each other. Many variantsof LSH have been developed, such as multi-probe LSH [2],[3], C2LSH [4] and LSB [5]. Binary hashing such as [6] isalso a hot research topic in computer vision.

The essence of LSH is that database points as well as querypoint are converted into a new space named hash space, whichusually is of low dimensionality. According to the property ofdistance preserving, we can easily get r candidate points in thehash space. After a refinement process, top-k nearest points areassigned as the approximate answer. However, a drawback of

the LSH scheme is that it only provides high probability thatnearby points in the original space are projected into nearbyregion in a new space. Thus, false positives and false negativesare inevitable. Although using more hash functions will assistto relieve the false positive, larger cost will be involved andfalse positives still exist.

The index-based approach [7] aims to reduce the number ofpages accessed when searching by building some index at thehash space. However, a large number of candidate points needto compute original distance for final results. The performancedeteriorates with a large number of queries and limited diskpage accesses. Recently, neighbor-sensitive hashing (NSH) [8]provides a novel approach to reduce the number of pointssharing same hash code by enlarging the distance betweennearby points in hash space, in order to reduce the cost ofrefinement. For large-scale data set and under limited hashcode number, NSH may not perform well. Moreover, NSHdoes not have mechanism to relieve the effect of false positiveand false negative. Statistical optimization techniques havealso been investigated for learning-based or data-dependenthashing, by taking the distribution of data into account in orderto obtain more effective hash functions. Notable approaches inthis direction include spectral hashing (SH) [9], anchor graphhashing (AGH) [10], spherical hashing (SpH) [6], compressedhashing (CH) [11], complementary projection hashing (CPH)[12] and data sensitive hashing (DSH) [13]. DSH keeps kNNitems together using adaptive boosting and spectral techniques.Equipped with the distribution knowledge, DSH solves thekey challenge of generating the data sensitive hashing family.However, DSH is sensitive to the data distribution, and itcannot relieve the effect false positives and false negatives aswell.

Our idea: Most existing LSH scheme only provides highprobability that nearby points in the original space are pro-jected into nearby region in a new space. Besides, indexprovides a good choice to filter unrelated points. If there aresome transforming functions that ensure the distance aftertransformation holds a strict relationship with original distance,we can use them as an intermediate step to transform originaldata to a new space and build the index by the strict distancerelationship. Then, candidate points are searched with theindex structure and LSH can be employed to candidates pointsin the intermediate space for final results. In this way, falsepositives and negatives will be relieved and the searching timewill become small.

We illustrate our overall idea in Figure 1. A multiplepreserving-ignoring transformation (PIT) function is applied

2017 IEEE 33rd International Conference on Data Engineering

2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

93


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

93


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

93


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.47

91

��

��

��

��

��

��

Fig. 1. Our idea.

to original data to convert it from d-dimensional to m-dimensional. The transforming function must satisfy that thenearby points in the original space locate at nearby regionin the new space with a strict order. The index built in thetransformed space can easily filter large number of unrelatedpoints. Finally, LSH is applied to convert transformed candi-date points from m-dimensional to b-dimensional hash space.The transforming function preserves strict distance order, andaccording to the strict distance order the index will filter a lotof unrelated points. Thus, to a large extent false positives andfalse negatives can be relieved.

The rest of the paper is organized as follows. We intro-duce the preserving-ignoring transformation in the transformedspace in Section II. In Section III, we introduce a linearorder and our PIT based index for approximate kNN search.Experimental results are shown in Section IV.

II. PRESERVING-IGNORING TRANSFORMATION

We first present the basic idea of preserving-ignoringtransformation (PIT), followed by a particular and concreteexample. To apply the proposed PIT for kNN search of anarbitrary query, we further investigate a multiple preserving-ignoring transformation mechanism.

A. PIT Function

Given a query point for kNN search, what is of userinterest are its top-k closest points, and points whose distancesare beyond top-k distance threshold are not important. Thismotivates us to seek a kind of function to map the original datato a new metric space which ideally can preserve neighborhoodrelationship of top-k nearby points and at the same time ignorethose points far away. A transformation with this kind ofcapacity is regarded with “preserving-ignoring” property. Forpreserving neighborhood relationship, we can assign weightsof nearby points according to their distances to a center point (asmaller distance corresponds to a larger weight). For ignoringfar away distance, we can assign the weight of that kind ofpoints close to a constant, such as zero. In this way, pointswith larger weights can be selected as the final kNN items.

To formally introduce the concept of preserving-ignoringtransformation, we first define the distance in the originalspace.

Definition 1 (Original Distance): Let Pi and Pj be twopoints of d dimensionality in the original metric space. Wedefine that distO(i,j) = ‖Pi − Pj‖ is the original distancebetween Pi and Pj .

We then define a preserving-ignoring transformation func-tion, which maps data points into a new space.

Definition 2 (Transforming Function): Let PITi() be atransforming function with respect to a point Pi, it mapsanother point Pj in the original space into a new space, with1-dimensional weight score represented as PITi(Pj).

Based on the above definition, we can find that the trans-forming function is related to a point Pi, which is regarded asthe transforming center.

We finally define a new kind of distance in the transformedspace to represent the weight described above.

Definition 3 (Transformed Distance): Let Pj and Pk betwo points of d dimensionality in the original space, PITi() bea transforming function with respect to a point Pi, PITi(Pj)and PITi(Pk) be two points in transformed space. We definethat distTi(j,k) = ‖PITi(Pj)−PITi(Pk)‖ is the transformeddistance between Pj and Pk with respect to Pi.

A transforming function having both capacities of preserv-ing neighborhood relationship for nearby points and ignoringfar away distance for distant points should satisfy the followingrequirements:

• Let Pi, Pj and Pk be three points in the original space.Assume Pi is the transforming center and PITi() isa transforming function with respect to Pi. PITi(Pj)and PITi(Pk) are two points in the transformed spacecorresponding to Pj and Pk. Pj is within the top-k closestpoints of Pi in the original space and Pk is out of the top-k closest points. Therefore, PITi(Pj) has a score keepingaway from zero and PITi(Pk) has a score close to zero.

• For Pi, Pj and Pk, if Pj and Pk are both within thetop-k closest points of Pi, for preserving neighborhoodrelationship, the following conditions should hold:* If ‖Pi − Pj‖ − ‖Pi − Pk‖ ≥ 0, then PITi(Pj) −

PITi(Pk) ≤ 0.* If ‖Pi − Pj‖ − ‖Pi − Pk‖ < 0, then PITi(Pj) −

PITi(Pk) > 0.• For Pi, Pj and Pk, if Pj is within the top-k closest points

of Pi and Pk is out of top-k closest points of Pi, thefollowing conditions should hold:* ‖Pi − Pj‖ − ‖Pi − Pk‖ < 0, and PITi(Pj) −

PITi(Pk) > 0.• For Pi, Pj and Pk, if Pj and Pk are both out of top-k

closest points of Pi, for ignoring far away distance, thefollowing conditions should hold:* If ‖Pi − Pj‖ − ‖Pi − Pk‖ ≥ 0, then PITi(Pj) −

PITi(Pk) → 0.* If ‖Pi − Pj‖ − ‖Pi − Pk‖ < 0, then PITi(Pj) −

PITi(Pk) → 0.

A particular transforming function can be given as

Tq(Pi) = exp(−(distO(q, Pi)

γ)2) (1)

where q is a query, Pi is an arbitrary point in original dataspace, distO(q, Pi) is the original distance between q and Pi,and γ is a parameter which will be determined by the numberof requested results k for kNN search. Note that, this kind offunctions actually are abundant and we only present one ofthem here.

949494929292929292929292

��

��

��

��

(a) Acceptable queries measured bypoints

��

��

� ��

(b) Acceptable queriesmeasured by distance

Fig. 2. Visual demonstration of acceptable queries.

��

� ��

��

�

� ��

��

� ��

��

� ��

��

��

��

Fig. 3. Multiple preserving-ignoring transformation mechanism.

To avoid a restriction that the query point is the trans-forming center, a solution is to make the distance betweentransforming center point p and query point q small. TakingEquation 1 into consideration, it can be observed that thefunction Tp(q) is monotonically decreasing with the increaseof distO(p, q) for an arbitrary q. The proximity of p and q inthe transformed space is determined by the ratio of distO(p, q)to the value of γ. When distO(p, q) < λγ, Tp(q) will beabove exp(−(λ)2). It means that Tp(q) asymptotically equalsto Tp(p) and query point p approximately approaches to trans-forming center q. Therefore, for a query q and a transformingcenter point p if distO(p, q) < λγ, the “preserving-ignoring”property holds for Tp(Pi). Meanwhile, we name the queryregion satisfying distO(p, q) < λγ as “acceptable region”.Figure 2 gives a visual demonstration of acceptable queriesfor a center point measured by points and distance.

B. Multiple Preserving-Ignoring Transformation

A transforming function using only a single transformingcenter does not work for kNN search problem since anarbitrary query can be given, and it is not practical to selecta suitable center point and map data again. Therefore, wepropose a multiple preserving-ignoring transformation mecha-nism, which simultaneously uses multiple transforming centersto transform data points to a new space.

Figure 3 gives a visual demonstration of the multiplepreserving-ignoring transformation mechanism. {Pi} is a set oftransforming centers and {TPi(Px)} denotes the correspond-ing transforming functions. As Figure 3 shows, each centertransforming affects those points belonging to its “acceptableregion” and ignores distant points. For applying multiplepreserving-ignoring transformation in our kNN search algo-rithm, a compound PIT function is employed.

Definition 4 (Compound PIT Function): For a set of cen-ter points {p1, . . . , pm}, let {PITp1(), . . . , P ITpm()} be aset of corresponding transforming functions, a compound PITfunction ΓPIT is formed by:

ΓPIT (x) = (PITp1(x), . . . , P ITpm(x)) (2)

Applying multiple preserving-ignoring transformation alsohas an effect of dimensionality change, and it will affect thehashing step later. Thus, we have reason for assigning thenumber of centers as m = μb, where μ is a positive integer andb is the number of hash function. In our work, the compoundPIT function ΓPIT is:

ΓT (x) = (Tp1(x)), . . . , Tpm(x)) (3)

For an arbitrary point x, the compound PIT value is(Tp1(x), . . . , Tpm(x)).

III. SEARCH WITH LINEAR ORDER AND INDEX

Based on the proposed transformation, we then define alinear order to re-rank all data in the transformed space, whichis of great importance for building index for approximate kNNsearch.

A. Linear Order

Before introducing the linear order, we first need to define alinear order measurement to describe the compound PIT value.We denote the linear order measurement of the compound PITvalue ΓPIT (x) as LM(ΓPIT (x)), and define it as follow:

Definition 5 (Linear Order Measurement): Let x be an ar-bitrary point and ΓPIT (x) be its compound PIT value, thelinear order measurement LM(ΓPIT (x)) of x is defined asfollows:

LM(ΓPIT (x)) = i,

P ITPi(x) = argmax(PITPj

(x)), j = 1, . . . , m(4)

Intuitively, LM(ΓPIT (x)) is the maximum value di-mension of x of its compound PIT value ΓPIT (x). Weuse an example to demonstrate how to compute the valueof LM(ΓPIT (x)). If we have m = 5 and ΓPIT (x) =(0.6, 0.9, 0.5, 0, 0), according to Definition 5, PITP2(x) =argmax(Pj), j = 1, . . . , 5, so LM(ΓPIT (x)) = 2.

Next, we introduce the relation between the compound PITvalues with the linear order measurements. For a data set Dand a compound PIT function ΓPIT (x), we define an algebrasystem � = {ΓPIT (x), x ∈ D} and a binary relation �� ofcompound PIT values. It can be proved that �� is a linearorder on �.

B. Search Process

Now, LSH can be applied to points in the transformedspace. Compared with applying LSH to points in the originalspace, the false positives and false negatives will be decreased.In our approach, we apply the compound LSH function ΓHto those points transformed by the proposed transformingfunction. First, a d-dimensional point in the original spacewill be transformed to m-dimensional by our compound PITfunction ΓPIT . Then, the m-dimensional point is transformedto b-dimensional by the compound LSH function ΓH .

In addition, we introduce an index structure based on thepreserving-ignoring transformation. The linear order will bean important factor to ensure the accuracy and efficiencyfor approximate kNN search. We compute the linear order

959595939393939393939393

0

0.2

0.4

0.6

0.8

1

8 16 32 64 128

reca

ll

Length of compound hash functions b

PIT based indexNSHDSH

SK-LSH

(a) Corel.

0

0.2

0.4

0.6

0.8

1

8 16 32 64 128

reca

ll


PIT based indexNSHDSH

SK-LSH

(b) Sift.

Fig. 4. Effect of the length of compound hash b on recall.

measurement LM(ΓT (Pci)) of transforming centers by Def-inition 5. Meanwhile, we apply our compound transformingfunction to all points. Finally, we store original data and theircorresponding compound PIT values to corresponding linearorder measurement LM(ΓT (Pci)). In the way, we avoid tostore all points by their compound PIT values, which is muchmore expensive to compute. To sum up, given a data setD and a compound PIT function ΓPIT , an index file canbe constructed with two parts: (i) corresponding compoundPIT values; (ii) original data points and the index key tocorresponding transforming centers.

IV. EXPERIMENTS

Our experiments are conducted on four real-life data sets:Corel features1 which contains 68040 32-dimensional colorhistograms, Forest2 which contains 581012 54-dimensionaldata points, Audio3 which contains 54387 192-dimensionalvectors and Sift4 which contains 1 million base vectors and10000 query vectors, and each vector is 128-dimensional. Todemonstrate the superiority of our PIT based index approach,we compare it with three state-of-the-art methods proposedrecently: SK-LSH [7], DSH [13] and NSH [8]. Due to pagelimit, we only show the results on Corel and Sift data sets, assimilar patterns exist for Forest and Audio data sets.Evaluation on query quality: Figure 4 shows the effect

of length of compound hash functions b with respect torecall on four real data sets. Our PIT based index approachperforms better than other state-of-the-art methods for averagerecall when b exceeds 32 in all data sets. The second bestperformance is achieved by NSH.Evaluation on query time: In order to better demonstrate

and compare the average response time, time reduction isplotted which shows the ratio of time saving of PIT basedindex approach compared with competitive method. The resultis shown in Figure 5, with varying length of compound hashfunctions b.

ACKNOWLEDGMENTS

This work is supported by the National Nature Sci-ence Foundation of China (grants No. 61672133, No.61602087, No. 61572108 and No. 61632007), and theFundamental Research Funds for the Central Universities(grants No. ZYGX2015J058, No. ZYGX2015J055 and No.ZYGX2014Z007).

1http://kdd.ics.uci.edu/databases/CorelFeatures/2http://archive.ics.uci.edu/ml/datasets/Covertype3http://www.cs.princeton.edu/cass/audio.tar.gz4http://corpus-texmex.irisa.fr/

0 20 40 60 80

100

8 16 32 64 128

time

redu

ctio

n


NSHSK-LSH

DSH

(a) Corel.

0 20 40 60 80

100

8 16 32 64 128

time

redu

ctio

n


NSHSK-LSH

DSH

(b) Sift.

Fig. 5. Effect of the length of compound hash functions b on time reduction(∗ 100).

REFERENCES

[1] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towardsremoving the curse of dimensionality,” in Proceedings of the ThirtiethAnnual ACM Symposium on the Theory of Computing, Dallas, Texas,USA, May 23-26, 1998, 1998, pp. 604–613.

[2] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,”in Proceedings of the 33rd International Conference on Very Large DataBases, University of Vienna, Austria, September 23-27, 2007, 2007, pp.950–961.

[3] A. Joly and O. Buisson, “A posteriori multi-probe locality sensitivehashing,” in Proceedings of the 16th International Conference onMultimedia 2008, Vancouver, British Columbia, Canada, October 26-31, 2008, 2008, pp. 209–218.

[4] J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing schemebased on dynamic collision counting,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, SIGMOD2012, Scottsdale, AZ, USA, May 20-24, 2012, 2012, pp. 541–552.

[5] Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Quality and efficiency inhigh dimensional nearest neighbor search,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, SIGMOD2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, 2009,pp. 563–576.

[6] J. Heo, Y. Lee, J. He, S. Chang, and S. Yoon, “Spherical hashing,” in2012 IEEE Conference on Computer Vision and Pattern Recognition,Providence, RI, USA, June 16-21, 2012, 2012, pp. 2957–2964.

[7] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “SK-LSH: an efficientindex structure for approximate nearest neighbor search,” PVLDB,vol. 7, no. 9, pp. 745–756, 2014.

[8] Y. Park, M. J. Cafarella, and B. Mozafari, “Neighbor-sensitive hashing,”PVLDB, vol. 9, no. 3, pp. 144–155, 2015.

[9] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances inNeural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems,Vancouver, British Columbia, Canada, December 8-11, 2008, 2008, pp.1753–1760.

[10] W. Liu, J. Wang, S. Kumar, and S. Chang, “Hashing with graphs,” inProceedings of the 28th International Conference on Machine Learning,ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 2011,pp. 1–8.

[11] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hashing,” in2013 IEEE Conference on Computer Vision and Pattern Recognition,Portland, OR, USA, June 23-28, 2013, 2013, pp. 446–451.

[12] Z. Jin, Y. Hu, Y. Lin, D. Zhang, S. Lin, D. Cai, and X. Li, “Com-plementary projection hashing,” in IEEE International Conference onComputer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013,2013, pp. 257–264.

[13] J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi, “DSH: data sensitivehashing for high-dimensional k-nn search,” in International Conferenceon Management of Data, SIGMOD 2014, Snowbird, UT, USA, June22-27, 2014, 2014, pp. 1127–1138.

969696949494949494949494

本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


preserving-ignoring transformation based index for...

Documents