spatial joins based on na-trees

6
Information Processing Letters 109 (2009) 713–718 Contents lists available at ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Spatial joins based on NA-trees Hue-Ling Chen , Ye-In Chang Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan, ROC article info Article history: Received 31 October 2007 Available online 20 March 2009 Communicated by F. Dehne Keywords: Design of algorithms NA-tree R-tree Spatial index Spatial join 1. Introduction Spatial join is one important spatial query in a spatial database system which combines two spatial datasets to retrieve the matched pair of objects based on the spatial predicate. The spatial predicate specifies the geometrical relationship between objects, and the most common one is the overlap of spatial objects. In a GIS system, the spa- tial join may be used for an efficient implementation of the map overlay. The map overlay constructs a new map from two or more given maps which is important for geographic analysis. Application areas contain city planning, ecologi- cal, demographic studies, and so on [1,5,6]. However, there is no total ordering of spatial objects that preserves spa- tial proximity. It is complicated to represent the attributes and relations of spatial objects, even to evaluate the com- plexity of spatial queries. Since the spatial join constructs the large volume of spatial relations, to evaluate the spa- tial predicate between objects is time-consuming. Since This research was supported in part by the National Science Council of Republic of China under Grant No. NSC-93-2213-E-110-027. The authors like to thank National Sun Yat-Sen University, “Aim for Top University Plan” project of NSYSU, Ministry of Education, Taiwan, for partially sup- porting the research. * Corresponding author. E-mail address: [email protected] (H.-L. Chen). a spatial index is an accelerator for the spatial query to search an object quickly [1,5], we expect to take advantage of the spatial index to perform the spatial join efficiently. Recently, an Nine-Areas tree (denoted NA-tree) has been proposed to minimize the number of disk accesses during a tree search for spatial queries [3]. It solves the problem of overlaps between MBR’s of internal nodes in the R -tree and accesses less number of nodes than the R -tree [2,4]. Therefore, we present an algorithm for the spatial join based on NA-trees, i.e., the NA-tree join. Although several studies have been made on solving the overlapping prob- lem of the R -tree, e.g., R -tree [5], the R -tree join will not be better than the R -tree join. Because the spatial join aims at examining overlaps of objects quickly, the R -tree join has to make a larger number of nodes comparisons than the R -tree join. For the other techniques which are used to improve the performance of the R -tree join in [5], they also can be helpful to our NA-tree join. Therefore, we compare the performance of the original R -tree join with our NA-tree join. We analyze all overlaps between regions of two nodes in two NA-trees and obtain one cor- relation table which records the pair of overlapped nodes. Our NA-tree join can execute efficiently by looking up this correlation table, instead of nodes comparisons. Our NA- tree join can access a smaller number of nodes directly based on the correlation table than the R -tree join. 0020-0190/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2009.03.015

Upload: hue-ling-chen

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial joins based on NA-trees

Information Processing Letters 109 (2009) 713–718

Contents lists available at ScienceDirect

Information Processing Letters

www.elsevier.com/locate/ipl

Spatial joins based on NA-trees ✩

Hue-Ling Chen ∗, Ye-In Chang

Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan, ROC

a r t i c l e i n f o

Article history:Received 31 October 2007Available online 20 March 2009Communicated by F. Dehne

Keywords:Design of algorithmsNA-treeR-treeSpatial indexSpatial join

1. Introduction

Spatial join is one important spatial query in a spatialdatabase system which combines two spatial datasets toretrieve the matched pair of objects based on the spatialpredicate. The spatial predicate specifies the geometricalrelationship between objects, and the most common oneis the overlap of spatial objects. In a GIS system, the spa-tial join may be used for an efficient implementation of themap overlay. The map overlay constructs a new map fromtwo or more given maps which is important for geographicanalysis. Application areas contain city planning, ecologi-cal, demographic studies, and so on [1,5,6]. However, thereis no total ordering of spatial objects that preserves spa-tial proximity. It is complicated to represent the attributesand relations of spatial objects, even to evaluate the com-plexity of spatial queries. Since the spatial join constructsthe large volume of spatial relations, to evaluate the spa-tial predicate between objects is time-consuming. Since

✩ This research was supported in part by the National Science Councilof Republic of China under Grant No. NSC-93-2213-E-110-027. The authorslike to thank National Sun Yat-Sen University, “Aim for Top UniversityPlan” project of NSYSU, Ministry of Education, Taiwan, for partially sup-porting the research.

* Corresponding author.E-mail address: [email protected] (H.-L. Chen).

0020-0190/$ – see front matter © 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.ipl.2009.03.015

a spatial index is an accelerator for the spatial query tosearch an object quickly [1,5], we expect to take advantageof the spatial index to perform the spatial join efficiently.Recently, an Nine-Areas tree (denoted NA-tree) has beenproposed to minimize the number of disk accesses duringa tree search for spatial queries [3]. It solves the problemof overlaps between MBR’s of internal nodes in the R-treeand accesses less number of nodes than the R-tree [2,4].Therefore, we present an algorithm for the spatial joinbased on NA-trees, i.e., the NA-tree join. Although severalstudies have been made on solving the overlapping prob-lem of the R-tree, e.g., R∗-tree [5], the R∗-tree join willnot be better than the R-tree join. Because the spatial joinaims at examining overlaps of objects quickly, the R∗-treejoin has to make a larger number of nodes comparisonsthan the R-tree join. For the other techniques which areused to improve the performance of the R-tree join in [5],they also can be helpful to our NA-tree join. Therefore,we compare the performance of the original R-tree joinwith our NA-tree join. We analyze all overlaps betweenregions of two nodes in two NA-trees and obtain one cor-relation table which records the pair of overlapped nodes.Our NA-tree join can execute efficiently by looking up thiscorrelation table, instead of nodes comparisons. Our NA-tree join can access a smaller number of nodes directlybased on the correlation table than the R-tree join.

Page 2: Spatial joins based on NA-trees

714 H.-L. Chen, Y.-I. Chang / Information Processing Letters 109 (2009) 713–718

Fig. 1. An example of the bucket numbering scheme, O (l, u) = (9,14).

2. An NA-tree

In an NA-tree [3], a spatial object is specified byits bounding rectangle and represented by two points,L(Xl, Yb) and U (Xr, Yt), where L is the lower left coordi-nate and U is the upper right coordinate of the boundingrectangle. Based on the bucket-numbering scheme [3], thespatial number O (l, u) is calculated for the spatial ob-ject O , where l is the bucket number of L(Xl, Yb) andu is the bucket number of U (Xr, Yt). For example, inFig. 1, the spatial number of object O is (9,14). A vari-able, Max_bucket, is used to record the maximum bucketnumber (in decimal form) of this area. In Fig. 1, the maxi-mum bucket number is 15 (0111), i.e., Max_bucket = 15.

An NA-tree is a structure based on data location andorganized by the spatial numbers. First, the whole spatialregion is decomposed into four regions with correspondinglimitations of bucket numbers as shown in Fig. 2(a) [3].Then, for an object O (l, u), it may be stored as one ofthe following children of an internal node p in an NA-treestructure, as shown in Fig. 2(b). For example, when l ∈ re-gion I and u ∈ region II, O is the 5th_child of node p.However, based on different types of children, an internalnode may have different numbers of children. For exam-ple, when it is the 5th_child or the 7th_child, it has onlythree children: 5th_child, 7th_child, and 9th_child. Finally,an object can only be stored in a leaf node, instead ofan internal node. Fig. 3 shows an example of an NA-treestructure. For example, object D6 in Fig. 3(a) is stored inthe 6th_child node in Fig. 3(b). In [3], an NA-tree can sup-port arbitrary insertions and deletions of objects withoutany global re-organization, depending on the spatial num-bers of objects. It also can efficiently support exact matchqueries and range queries.

Fig. 2. An NA-tree: (a) four regions; (b) the basic structure.

Page 3: Spatial joins based on NA-trees

H.-L. Chen, Y.-I. Chang / Information Processing Letters 109 (2009) 713–718 715

Fig. 3. An example: (a) the data; (b) the corresponding NA-tree structure (bucket_capacity = 2).

Procedure NASJoin(P A , P B );/* N A and NB are child nodes of nodes P A and P B , respectively. *//* O A and O B are objects under the leaf nodes P A and P B , respectively. */begin

if (P A .leaf = false) and (P B .leaf = false) then /* Case 1 */for (all N A of P A ) do

for (all NB of P B ) doif (Correlation(N A .uid, NB .uid) = true) thenbegin ReadNode(N A ); ReadNode(NB ); NASJoin(N A , NB ); endelse continue

else if (P A .leaf = true) and (P B .leaf = true) then /* Case 2 */ObjectsJoin(P A , P B )

else if (P A .leaf �= P B .leaf) then /* Case 3 */begin

if (P A .leaf = true) then for (all O A of P A ) do Range_Query(P B , O A )else if (P B .leaf = true) then for (all O B of P B ) do Range_Query(P A, O B );

end;end;

Fig. 4. Procedure NASJoin.

3. The NA-tree join

Let A and B be two NA-trees which index two datasets.The process of our NA-tree join starts on two root nodes,continues on two internal nodes at each level, and stopson leaf nodes in two NA-trees. An overlap means that theintersection of regions of two nodes is non-empty. Basi-cally, three cases of the height of both input NA-trees willbe considered during our NA-tree join on pairs of nodesN A and NB in NA-trees A and B , respectively. These casesare described as follows:

• Case 1: NA-trees are of the same height and nodes N Aand NB are internal nodes.

• Case 2: NA-trees are of the same height and nodes N Aand NB are leaf nodes.

• Case 3: NA-trees are of different height; that is, onenode is an internal node, and another one is a leafnode.

Procedure NASJoin in Fig. 4 shows the process of ourNA-tree join. In this section, we first describe the definitionof a correlation pair used in function Correlation. Then, wedescribe the process of our NA-tree join on two NA-trees ofthe same height, i.e., Cases 1 and 2. Finally, we describe theprocess of our NA-tree join on two NA-trees of differentheight, i.e., Case 3.

3.1. A correlation pair

Since the spatial join investigates overlaps, a correla-tion pair (N A .uid, NB .uid) is used to describe the overlapwith uid’s of nodes N A and NB , respectively. An uid isused to represent the type of one node Nx based on theNA-tree structure in Fig. 2. Let Nx.uid = 1, 2, 3, 4, 5, 6,7, 8, and 9 represent each of nine child nodes Nx underone internal node at each level. (Note that the uid for theroot is 1.) Because the overlap is examined by relative re-gions of two nodes, we observe the region of the node Nxfirst. From Fig. 5, the region of the node Nx can be in oneof three shapes: Square, Rectangle, and Large Square. Letset S = {1,2,3,4}, set R = {5,6,7,8}, LS = {9} to recordthese uid’s whose corresponding shapes are square, rect-angle, and large square, respectively. From Fig. 5, we canobserve that the region of Nx.uid ∈ R or Nx.uid ∈ L S con-tains subregions which are smaller than it. For example,the region of Nx.uid = 5 ∈ R contains two subregions ofuid’s 1 and 2, respectively.

In our NA-tree join, we use function Correlation(N A .uid,NB .uid) to examine whether the overlap is non-empty ornot. Based on correlation pairs in nine cases of {S, R, LS} ×{S, R, LS}, we analyze all overlaps in table CPairTab, asshown in Fig. 6. The correlation pairs (N A, NB ) in tableCPairTab have non-empty overlaps and function Correlationreturns True. (Note that the value under the correlation

Page 4: Spatial joins based on NA-trees

716 H.-L. Chen, Y.-I. Chang / Information Processing Letters 109 (2009) 713–718

Fig. 5. Three sets of uid’s and regions of nodes with uid’s.

Fig. 6. Table CPairTab: overlaps of correlation pairs (N A .uid, NB .uid).

pair is the uid of the real region.) On the other hand, theempty slots means empty overlaps and function Correla-tion returns False. Therefore, table CPairTab can help func-tion Correlation to return True or False value. The resultof those non-empty overlaps recorded in table CpairTab isdetermined by considering two conditions on N A .uid andNB .uid.

• Condition 1: If N A .uid is equal to NB .uid, the resultingoverlap is N A .uid (or NB .uid). Otherwise, we have toconsider Condition 2.

• Condition 2: Assume that N A .uid is larger than NB .uid,and vice versa, there are three subconditions.(1) If 5 � N A .uid � 8 (N A .uid ∈ R) and 1 � NB .uid � 4

(NB .uid ∈ S), the resulting overlap is NB .uid. Thecorrelation pairs in the light-shaded portion satisfyCondition 2(1).

(2) If 6 � N A .uid � 8 (N A .uid ∈ R) and 5 � NB .uid � 7(NB .uid ∈ R), the resulting overlap is the uid of thecommon subregion. The correlation pairs in thedark-shaded portion satisfy Condition 2(2).

(3) If N A .uid = 9 (N A .uid ∈ L S) and 1 � NB .uid � 8,the resulting overlap is NB .uid.

Let us take Fig. 7 as an example to illustrate Condition 2in table CPairTab. The correlation pair (7,3) satisfies Con-dition 2(1) and the resulting overlap is NB .uid (= 3). The

correlation pair (7,6) satisfies Condition 2(2). Since two re-gions of N A .uid (= 7) and NB .uid (= 6), respectively, havethe non-empty overlap in the common subregion, the re-sulting overlap is the uid of this subregion, i.e., 3. Fromwhat has been discussed above, we can examine the over-lap of the correlation pair (N A .uid, NB .uid) easily by look-ing it up from table CPairTab.

3.2. NA-trees of the same height

Assume that two NA-trees for our NA-tree join are ofthe same height. Let us take Fig. 8 as an example to illus-trate Cases 1 and 2 of procedure NASJoin in Fig. 4. Figs. 8(a)and 8(b) are input datasets with their NA-trees of the sameheight. First, because roots R A and R B have child nodes(Case 1), correlation pairs (1,1), (5,1) and (5,8) are exam-ined to have overlaps by function Correlation. Then, objectsare retrieved sequentially from the disk referenced by leafnodes with their uid’s in these correlation pairs (Case 2).The coordinates of these spatial objects are compared. Fi-nally, the last result contains (A1, B2) and (A2, B2) whichhave non-empty overlaps between them.

3.3. NA-trees of different height

When one NA-tree is higher than another one, objectsmay be stored in the leaf nodes at different height. OurNA-tree join may be executed synchronously on one in-ternal node and one internal node at the same level. It isa condition of Case 3 in Fig. 4. Let us take Fig. 9 as anexample. After the spatial join is executed at level 1, wecan obtain all correlation pairs which have overlaps. Letus take the correlation pair (N A .uid, NB .uid) = (1,1) as anexample to illustrate procedure RangeQuery. The node N Ais a leaf node, while the node NB is an internal node atlevel 1. The regions of N A .uid and NB .uid, respectively, areshown as the thick dotted-lined regions in the right sideof Figs. 9(a) and 9(b), respectively. First, we retrieve ob-ject A2 from node N A as a query rectangle and perform arange query on node NB by procedure RangeQuery. Next,object A2 has overlaps with the region of the existing twochild nodes (uid = 5 and uid = 9) under node NB at thelevel 2. Then, we retrieve objects B4 and B2 from thesetwo nodes, respectively. We compare their positions withthe query rectangle A2. Finally, the pair of objects (A2, B2)is output as the result.

4. Performance

In our simulation, we compare the performance of theR-tree join [2] and our NA-tree join [3]. We consider the

Page 5: Spatial joins based on NA-trees

H.-L. Chen, Y.-I. Chang / Information Processing Letters 109 (2009) 713–718 717

Fig. 7. Overlaps of the correlation pairs (N A .uid, NB .uid) assuming N A .uid > NB .uid.

Fig. 8. An example: (a) the first dataset with NA-tree A; (b) the second dataset with NA-tree B .

Fig. 9. An example of the correlation pair (N A .uid, NB .uid) = (1,1) at the level 1: (a) the first dataset with NA-tree A; (b) the second dataset with NA-tree B .

overlap selectivity (0 � p � 1), which is defined as the num-ber of the data objects with overlap relationship in the to-tal number N of data objects. We assume that the buffersfor join operations are infinite large. We took 1000 aver-ages of results after the spatial join on two datasets withN = 10000 and variable p. Since the spatial join combinestwo datasets by their spatial relationship, we consider theCPU time (the number of nodes comparisons) and the I/Otime (the number of disk accesses) as performance mea-sures [2].

Fig. 10(a) shows the result for the measure of the CPUtime. Our NA-tree join requires a smaller number of nodescomparisons than the R-tree join. The reason is that ourNA-tree join examine overlaps only by looking up the staticnumber of correlation pairs in table CpairTab. However, thenumber of nodes comparisons of the R-tree join increasesas the value of the overlap selectivity p (i.e., the number ofoverlaps) increases. The reason is that the R-tree join hasto compare many related MBR’s of the internal nodes dueto overlaps between them. Fig. 10(b) shows the result forthe measure of the I/O time. Our NA-tree join requires a

smaller number of disk accesses than the R-tree join. Be-cause our NA-tree join produces correlation pairs of nodeswhich have non-empty overlaps, the objects can be directlyaccessed by these candidate nodes. Since the objects arewell ordered in the disk based on the bucket numberingscheme [3], the pairs of objects which have non-emptyoverlaps can be accessed once by their sequential spatialnumbers. However, there is a huge increase in the numberof disk accesses of the R-tree join as the value of p in-creases. Because of no sequential order of these objects inR-trees, the R-tree join has to access objects from the diskmore than once to obtain the actual result.

5. Conclusion

In this paper, we have presented our NA-tree join basedon the correlation table. Our NA-tree join simply uses thecorrelation table to directly obtain candidate leaf nodes ontwo NA-trees which have non-empty overlaps. Moreover,our NA-tree join accesses objects once from those candi-date leaf nodes and returns pairs of objects which havenon-empty overlaps.

Page 6: Spatial joins based on NA-trees

718 H.-L. Chen, Y.-I. Chang / Information Processing Letters 109 (2009) 713–718

Fig. 10. (a) A comparison of the number of nodes comparisons (CPU): N = 10000; (b) A comparison of the number of disk accesses (I/O): N = 10000.

References

[1] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, J.S. Vitter,A unified approach for indexed and non-indexed spatial joins, in: Proc.of the 7th Int. Conf. on Extending Database Technology: Advances inDatabase Technology, 2000, pp. 413–429.

[2] T. Brinkhoff, H.P. Kriegel, B. Seeger, Efficient processing of spatial joinsusing R-trees, in: Proc. of ACM SIGMOD Int. Conf. on Management ofData, 1993, pp. 237–246.

[3] Y.I. Chang, C.H. Liao, H.L. Chen, NA-trees: A dynamic index for spatialdata, J. Inform. Sci. Eng. 19 (1) (2003) 103–139.

[4] J.P. Dittrich, B. Seeger, Data redundancy and duplicate detection inspatial join processing, in: Proc. of the 16th Int. Conf. on Data Eng.,2000, pp. 535–546.

[5] E.H. Jacox, H. Samet, Spatial join techniques, ACM Trans. on DatabaseSystems 32 (1) (2007) 7–51.

[6] M. Zhu, D. Papadias, J. Zhang, D.L. Lee, Top-k spatial joins, IEEE Trans.on Knowledge and Data Eng. 17 (4) (2005) 567–579.