design, implementation, and test of an efficient processing...

25
B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03 1 Design, Implementation, and Test of An Efficient Processing Technique for Complex Topological Queries Byunggu Yu and Cesar A. Marron Department of Computer Science PO Box 3315 University of Wyoming Laramie, WY 82072 Phone: (307) 766-2440, Fax: (307) 766-4036 Email: [email protected] Abstract There is an increasing demand for database-supported geographic information systems that are streamlined for handling complex statistical or analytical queries. In practical scientific or analytical geographic database applications, an efficient processing technique for complex topological queries that involve multiple query regions and topological relations is required to efficiently ascertain important facts or phenomena in large sets of geographic data. Unfortunately, there is a marked lack of efficient methods for processing complex topological queries. In most geographic information systems, users frequently execute multiple individual operations recursively to collect or analyze data due to a marked lack of adequate query processor for complex spatial queries. The paper presents the design, implementation, and test of an efficient processing technique for complex topological queries. The first prototype of our advanced GIS, called the relational geographic information system (RGIS), is introduced for the first time as the testbed for the proposed query processing techniques. Keywords: spatial databases, geographic information systems, access methods, topological queries, query evaluation 1. INTRODUCTION In this paper, we present the design, implementation, and test of our advanced methods for processing complex topological queries over geographic databases. Topological queries constitute one of the most important query types that must be supported by Geographic Information Systems (GIS’s). In GIS’s, complex spatial properties (e.g., absolute positions, relative positions, shapes, and regions) of data objects and spatial relationships between objects must be maintained to support various spatial operations [25]. For example, the operation of selecting every lake whose region is covered by the given region uses the topological relation "covered_by". Only advanced GIS’s support some simple topological queries such as: “report every lake in New York” and “report every city that overlaps a given region q”. These simple topological queries involve a query region (in the above examples, “New York” and “q”), a data set (“lake” and “city”), and a topological relation (e.g., inside, contains, overlap). Since geographic data are very complex and large relative to conventional one- dimensional data, the spatial operations (simple spatial queries) on these data are expensive. This affects the development of high-performance spatial access methods (SAMs) and so they continue to attract considerable attention [21, 25]. Recent SAMs efficiently support simple spatial queries. In practical scientific or analytical GIS applications, queries are more complex, and there is an increasing demand for database-supported GIS’s that are streamlined for handling complex statistical or

Upload: others

Post on 10-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    1

    Design, Implementation, and Test of An Efficient Processing Technique for Complex Topological Quer ies

    Byunggu Yu and Cesar A. Marron

    Department of Computer Science PO Box 3315

    University of Wyoming Laramie, WY 82072

    Phone: (307) 766-2440, Fax: (307) 766-4036 Email: [email protected]

    Abstract

    There is an increasing demand for database-supported geographic information systems that are streamlined for handling complex statistical or analytical queries. In practical scientific or analytical geographic database applications, an efficient processing technique for complex topological queries that involve multiple query regions and topological relations is required to efficiently ascertain important facts or phenomena in large sets of geographic data. Unfortunately, there is a marked lack of efficient methods for processing complex topological queries. In most geographic information systems, users frequently execute multiple individual operations recursively to collect or analyze data due to a marked lack of adequate query processor for complex spatial queries. The paper presents the design, implementation, and test of an efficient processing technique for complex topological queries. The first prototype of our advanced GIS, called the relational geographic information system (RGIS), is introduced for the first time as the testbed for the proposed query processing techniques. Keywords: spatial databases, geographic information systems, access methods, topological queries, query evaluation 1. INTRODUCTION In this paper, we present the design, implementation, and test of our advanced methods for processing

    complex topological queries over geographic databases.

    Topological queries constitute one of the most important query types that must be supported by

    Geographic Information Systems (GIS’ s). In GIS’s, complex spatial properties (e.g., absolute positions,

    relative positions, shapes, and regions) of data objects and spatial relationships between objects must be

    maintained to support various spatial operations [25]. For example, the operation of selecting every lake

    whose region is covered by the given region uses the topological relation "covered_by". Only advanced

    GIS’s support some simple topological queries such as: “ report every lake in New York” and “ report

    every city that overlaps a given region q” . These simple topological queries involve a query region (in the

    above examples, “New York” and “q” ), a data set (“ lake” and “city” ), and a topological relation (e.g.,

    inside, contains, overlap). Since geographic data are very complex and large relative to conventional one-

    dimensional data, the spatial operations (simple spatial queries) on these data are expensive. This affects

    the development of high-performance spatial access methods (SAMs) and so they continue to attract

    considerable attention [21, 25]. Recent SAMs efficiently support simple spatial queries.

    In practical scientific or analytical GIS applications, queries are more complex, and there is an

    increasing demand for database-supported GIS’s that are streamlined for handling complex statistical or

  • 2

    analytical queries. Processing complex topological queries that involve multiple query regions and

    topological relations over existing GIS data is required to efficiently ascertain important facts or

    phenomena in large sets of geographic data. In Figure 11, for example, the following queries are complex

    topological queries: "report every road in q3 that overlies the underground gas pipes in q3", "report every

    landowner who owns land that overlaps q2, and covers any lake in q1", and "report every species whose

    habitat covers any of the lakes covered by the habitat of the plants that exist in both q1 and q2 but not in

    q3". In conventional approaches, to process the second query, three operations are required: (1) select

    every piece of land that overlaps the given region q2; (2) select every lake that is contained by the given

    region q1; (3) join (more precisely, theta join) the results of (1) and (2). Then, the conventional projection

    operation on the result of (3) generates the final query result. The last query is even more complex.

    Unfortunately, the optimized execution of complex topological queries constitutes an

    undeveloped field in the research area of geographic information systems [21]. In most GIS’s, users

    frequently execute multiple individual operations recursively to collect or analyze data due to a marked

    lack of adequate query processor for complex spatial queries.

    Figure 1. Query Regions

    The paper presents the design, implementation, and test of an efficient processing technique for

    complex topological queries. The technique can improve the performance and functionality of numerous

    types of spatial database systems including GIS. The rest of the paper is organized as follows. Section 2

    defines simple topological queries in a formal manner and presents an efficient processing technique for

    simple topological queries. Section 3 introduces an efficient processing technique for complex topological

    queries. Section 4 introduces the first working prototype of our advanced GIS called the Relational

    Geographic Information System (RGIS) and our experimental results. Section 5 concludes the paper and

    discusses our future work in this area of research. 2. TOPOLOGICAL QUERIES Since we are considering large-scale GIS’s based on database technology, we deal with geographic

    queries that can be written in well-established database languages, such as Relational Algebra, Relational

    1 The Wyoming map is provided by WyGISC (Wyoming Geographic Information Science Center) at the University of Wyoming.

    q1

    q2

    q3

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    3

    Domain Calculus, and Relational Tuple Calculus. That is, the queries can be written in SQL. In this paper,

    we use Relational Algebra to express query examples. In fact, an increasing number of large-scale GIS’s

    including Intergraph’s GeoMedia, ESRI’s multi-user ArcGIS based on ArcSDE, and our RGIS (the RGIS

    is introduced in Section 4) can store geographic data in a relational or object-relational database.

    Relational algebra is a set-oriented language that consists of several operations, such as selection

    (σ), projection (∏), Cartesian product (×), division (÷), and union (∪), to name a few. Relations (i.e., tables constituting a database) are closed under these operations; that is, every operation or a combination

    of these operations produces another relation. A database query consists of some or all of these

    operations. Some operations, such as σ, may involve a predicate that must be satisfied by all data objects in the output of the operation. For example, σr.A=3(r) produces a subset of the relation r. In the output, all objects (tuples) have a value of 3 on the attribute (property) A. Efficient processing techniques for

    database queries involving conventional Boolean operators, such as , and =, are well known.

    Of particular interest for GIS applications are the absolute or relative positions of objects in space

    and their spatial relationships, and GIS’s must support typical queries involving topological relations,2

    i.e. the relations between spatial entities that stay invariant under translation, rotation, and scaling of the

    universe [18]. For example, the query of selecting every object whose region contains a given region uses

    the topological relation contains in its query predicate. Other meaningful topological relations between

    spatial objects are: equal, inside, covers, covered_by, overlap, meet, and disjoint [18]. Hence, in this

    project, we are interested in the topological query predicates involving these topological relations. For

    example, the query: “ report the name, the size, and the water volume of every lake that are covered_by

    the given query region q” can be written as ∏ lake.name, lake.size, lake.water_volume(σcovered_by(lake.geometry, q)(lake)). Internally, a database query can be represented by a hierarchy of relational operations (execution

    plan or evaluation tree). Among the nodes (operations) of an evaluation tree, usually, only the leaf-level

    operations access the data relations of the underlying database. The query performance is heavily

    dependent on the performance of σ (and the joins involving σ); and the performance of these operations is dependent on the choice of the underlying access method (e.g., B+-tree). Especially for the operations

    involving topological relations, spatial access methods (SAMs) have been developed. While certain SAMs

    support more accurate representations of the regions [8, 14, 15], most spatial access methods employ

    some form of approximation of spatial objects in order to reduce the size of the index structure and

    simplify the search and update operations. Typical approximations of regions in space include minimum

    bounding rectangles (MBRs) [2, 9], minimum bounding circles (MBCs) [12, 23], and minimum bounding

    polygons (MBPs) [10]. Since MBR-based approximations are characterized by intermediate complexity

    and accuracy, they tend to be used more frequently than other kinds of approximations.

    In the area of spatial databases, Papadias et al. [18] adopted eight topological relations [4] to

    formally define spatial (topological) query predicates and to efficiently process spatial queries with R-tree

    2 Note, topological relations do not mean data relations (data tables or data sets) in the relational database model.

  • 4

    variants. These are: equal, contains, inside, covers, covered_by, overlap, meet, and disjoint. These

    relations are cognitively identical to the eight binary relations of RCC-8 [19] and provide complete

    coverage of all possible (cognitively meaningful3) binary relations that can be defined by considering the

    interiors, boundaries, and complements of two regional objects [7, 11, 18, 20]4. The semantics of these

    topological relations make the relations pair-wise disjoint, i.e. no two relations refer to the same

    configuration of two spatial objects (e.g., overlap ≠ overlap ∨ equal). In [25], we redefined the semantics of topological relations as shown in Table 1(a) to facilitate the development of SAMs. Note that the set of

    configurations between two objects that constitute a relation rel1 could be a subset of configurations

    constituting a relation rel2, which are denoted by rel1 ⊂ rel2. The following subset relationships can be inferred from Table 1(a): equal_to ⊂ covers, equal_to ⊂ covered_by, contains ⊂ covers, contained_by ⊂ covered_by, covers ⊂ overlaps, and covered_by ⊂ overlaps (note, overlaps ∩ meets = ∅ and not_disjoint = overlaps ∪ meets). The predicate contained_by (r, q), involving the relation contained_by and two spatial objects r and q, is read as “ r is contained by q” . Similarly, meets (r, q) is simply read as “ r meets

    q” , etc.

    Topological relation Semantics of the relation equal_to (r, q) All points of r are in the interior or on the boundary

    of q and vice versa. contains (r, q) All points of q are in the interior of r. covers (r, q) All points of q are in the interior or on the boundary

    of r. contained_by (r, q) All points of r are in the interior of q. covered_by (r, q) All points of r are in the interior or on the boundary

    of q. overlaps (r, q) Some points of r are in the interior of q. meets (r, q) Some points of r are on the boundary of q, but no

    point of r is in the interior of q. disjoint_with (r, q) All points of r are in the exterior of q.

    (a) Query predicate Selection predicate

    equal_to (r, q) equal_to (r’ , q’ ) contains (r, q) contains (r’ , q’ ) covers (r, q) covers (r’ , q’ ) contained_by (r, q) contained_by (r’ , q’ ) covered_by (r, q) covered_by (r’ , q’ ) overlaps (r, q) overlaps (r’ , q’ ) meets (r, q) overlaps (r’ , q’ ) ∨ meets (r’ , q’ )

    (b)

    Table 1. (a) Semantics of topological relations and (b) Topological relations that MBRs convey about objects

    Table 1(b) shows how the relations between actual objects other than disjoint_with, are reflected

    on the relations between their MBRs.5 In the table, r and q denote actual spatial objects, while r’ and q’

    3 In the sense that people indeed distinguish between relations [18, 30] 4 In addition to these predicates, nearest neighbor queries have shown to be very useful and important in spatial databases. But we do not consider nearest neighbor queries in the proposed project. 5 Spatial access methods that apply MBR approximations of objects are rather awkward in processing the relation

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    5

    represent their MBRs. For each row of the table, if the predicate of the first column involving two spatial

    objects is satisfied, then their MBRs must satisfy the predicate in the second column. The validity of this

    table was verified in [25].

    For a given query involving a topological relation, the database system first finds every object

    whose MBR satisfies the corresponding predicate in the second column of Table 1(b). For example, to

    answer the query: “ report every object whose region r contains the given query region q” , we must find

    every object whose MBR r’ contains the query window q’ (the MBR of q). In the refinement step, the

    actual figures corresponding to the selected objects are consulted to eliminate false hits (the refinement

    step is further discussed in Section 5). For this reason, a predicate in the first column of Table 1(b) is

    called query predicate, while the corresponding predicate in the second column is referred to as selection

    predicate.

    In recent years, many SAMs, such as the topological R*-tree [18] and our QSF-tree variants [16,

    17, 25] support specialized search: For query predicates involving different topological relations, a

    different search predicate is used to test the entries in the upper levels of the tree. The experimental

    results presented in [16, 17, 18, 24, 25], show that specializing search operations to discriminate different

    topological relations while traversing the tree can result in a significant reduction in page accesses.

    Figure 2 summarizes this process of reporting every object r satisfying t(r, q), where t is a

    topological relation, and q is a query (reference) figure.

    We have developed a prototype of an advanced 2D GIS called the Relational Geographic

    Information System (RGIS). The client (the user interface) of the RGIS is our customized ArcView

    running on a Windows 2000/XP platform. This system is described in more detail in Section 4. The RGIS

    created test geographic databases and populated them with the ArcView themes stored in the shape files

    and the DB files [5, 6] provided by Wyoming Geographic Information Science Center (WyGISC) at the

    University of Wyoming. Various queries were posed on the geographic databases generated by the RGIS.

    disjoint_with. As noted in [28], this relation should be processed by a sequential scan of the leaf nodes in the structure.

    Figure 2. Finding every object r satisfying t(r, q)

    QUERY PREDICATEt (r, q)

    SELECTION PREDICATEt’ (r’ , q’ )

    SEARCH PREDICATE(dependent on the access method)

    REFINEMENT-STEP(refine the candidate set)

    Select Data Objects

    Search Index Structure

    QUERY PREDICATEt (r, q)

    SELECTION PREDICATEt’ (r’ , q’ )

    SEARCH PREDICATE(dependent on the access method)

    REFINEMENT-STEP(refine the candidate set)

    Select Data Objects

    Search Index Structure

  • 6

    Figures 3, 4, and 5 show some of the 2D topological queries supported by the RGIS. While the query

    regions in Figures 3 and 4 are existing objects, the query region in Figure 5 was drawn by the user.

    Figure 3. Query result of “ repor t all cities in Wyoming”

    Figure 4. Query result of “ repor t all cities that over lap the selected section of Interstate-90”

    Figure 5. Query result of “ repor t all lakes that are contained by the given rectangle”

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    7

    3. COMPLEX TOPOLOGICAL QUERIES Thus far, we have discussed simple query predicates consisting of a simple topological predicate t(r, q).

    In practice, geographic queries often involve more complex topological predicates. A complex topological

    query predicate consists of multiple simple topological predicates. For example, the following queries

    involve complex topological query predicates (the underlined query predicates):

    Example 1. Report every landowner who owns land that overlaps q1 and covers q2.

    Example 2. Report every lake that overlaps q1 or that contains both q2 and q3.

    Example 3. Report every landowner who owns land that overlaps q2 and that covers or meets any lake

    that contains q1.

    In conventional DBMS, a complex query predicate is normalized to the conjunctive normal form

    (CNF), which is a conjunction of terms. Each term is a simple query predicate or a disjunction of simple

    query predicates (disjuncts). A similar normal form, called conjunctive topological expression [7], was

    introduced for complex topological query predicates: By the definition of conjunctive topological

    expression [7], the predicates of conjunctions all have distinct reference objects.

    For these complex queries, conventional database query processing techniques generate possible

    execution plans (evaluation trees) that are logically the same based on the equivalent rules (e.g., σθ1(σθ2R) = σθ1∧θ2R and σθ1R ∪ σθ2R = σθ1∨θ2R) and choose the most efficient one as the final execution plan. The final execution plan is chosen based on either a set of heuristics or estimated evaluation cost in terms of

    disk accesses and CPU time, or both [22]. For processing complex operations involving complex

    topological query predicates, the query transformation technique introduced in Section 2 (Figure 2) can be

    used in a straightforward manner. That is, a complex topological query predicate can be transformed into

    a selection predicate by converting every simple topological predicate into the corresponding selection

    predicate as shown in Table 1(b). For example, a complex query predicate overlaps (lake.geometry, q1) ∨ (contains (lake.geometry, q2) ∧ contains (lake.geometry, q3)) is mapped to the selection predicate

    )

    )(

    (

    )1,.().,.().,.(

    )2,.(.

    lakeland qgeometrylakecontainsgeometrylakegeometrylandmeetsgeometrylakegeometrylandcovers

    qgeometrylandoverlapslandownerland

    σσ

    ��

    )

    ()1,.()).,.().,.(()2,.(.

    lakeland

    qgeometrylakecontainsgeometrylakegeometrylandmeetsgeometrylakegeometrylandcoversqgeometrylandoverlapslandownerland

    ×

    ∏ ∧∨∧σ

    )()2 ,.( )1 ,.( . landqgeometrylandcoversqgeometrylandoverlapslandownerland ∧∏ σ

    )())3,.()2,.(()1,.( lakeqgeometrylakecontainsqgeometrylakecontainsqgeometrylakeoverlaps ∧∨σ

  • 8

    overlaps (lake.geometry’ , q1’ ) ∨ (contains (lake.geometry’ , q2’ ) ∧ contains (lake.geometry’ , q3’ )). For more efficient query processing, we have developed a verification-optimization technique for

    complex topological query predicates. This, introduced in Sections 3.1 and 3.2, is, in part, a logical

    extension of the semantic query optimization technique in [18].

    3.1 Complex Topological Selections

    The verification-optimization step for complex topological selections consists of the following two ideas:

    (1) Elimination of Contradictory Conjunctions:

    The query predicate covered_by(r, qi)∧covers(r, qj) is a contradictory conjunction if qi and qj do not satisfy covers(qi, qj). That is, if qi does not covers qj, then any region r covered_by qi cannot covers qj. All

    possible contradictory conjunctions have been reported in [18]. We defined the equivalent set (Table 2) of

    all possible contradictory conjunctions based on the semantics of the topological relations in Section 2.

    6Table 2. Conjunctions of query predicates yielding empty results

    equal_to (r, q2)

    contains (r, q2)

    covers (r, q2)

    contained_by (r, q2)

    covered_by (r, q2)

    overlaps (r, q2)

    meets (r, q2)

    equal_to (r, q1)

    ¬e(q1, q2) ¬ct(q1, q2) ¬cv(q1, q2) ¬ctb(q1, q2) ¬cvb(q1, q2) ¬o(q1, q2) ¬m(q1, q2)

    contains (r, q1)

    ¬ctb(q1, q2) ∅ ∅ ¬ctb(q1, q2) ¬ctb(q1, q2) ∅ o(q1, q2)∨ m(q1, q2)

    covers (r, q1)

    ¬cvb(q1, q2) ∅ ∅ ¬ctb(q1, q2) ¬cvb(q1, q2) ∅ o(q1, q2)

    contained_by (r, q1)

    ¬ct(q1, q2) ¬ct(q1, q2) ¬ct(q1, q2) ¬o(q1, q2) ¬o(q1, q2) ¬o(q1, q2) ¬o(q1, q2) ∨ cvb(q1, q2)

    covered_by (r, q1)

    ¬cv(q1, q2) ¬ct(q1, q2) ¬cv(q1, q2) ¬o(q1, q2) ¬o(q1, q2) ¬o(q1, q2) ¬(o(q1, q2)∨ m(q1, q2)) ∨ cvb(q1, q2)

    overlaps (r, q1)

    ¬o(q1, q2) ∅ ∅ ¬o(q1, q2) ¬o(q1, q2) ∅ cvb(q1, q2)

    meets (r, q1)

    ¬m(q1, q2) o(q1, q2)∨ m(q1, q2)

    o(q1, q2) ¬o(q1, q2) ∨ cv(q1, q2)

    ¬(o(q1, q2)∨ m(q1, q2)) ∨ cv(q1, q2)

    cv(q1, q2) ctb(q1, q2) ∨ ct(q1, q2)

    Given a complex selection, the verification-optimization module test every possible conjunctive

    pair of two simple topological predicates Pi and Pj such that i ≠ j. If Pi and Pj meets a contradiction condition in Table 2, one of the following optimizations can be done:

    - IF Pi is a term and Pj is a term THEN the result of the selection is an empty set. Note that,

    in CNF, if there is any pair of contradictory terms in a given complex selection predicate,

    the result of the selection is empty (that is, the system does not process the selection, but

    simply returns an empty result set.

    - ELSE IF Pi is a term and Pj is a disjunct in a term THEN eliminate the predicate Pj from

    the predicate.

    6 This table derived from the table 4 in [17, p. 100] with the topological relations in Section 3. In the table, e, ct, cv,

    ctb, cvb, o, and m stand for equal_to, contains, covers, contained_by, covered_by, overlaps, and meets respectively.

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    9

    - ELSE IF Pi is a disjunct in a term and Pj is a term THEN eliminate the predicate Pi from

    the predicate.

    - ELSE IF Pi is a disjunct in a term and Pj is a disjunct in a term THEN eliminate either Pi

    or Pj, ideally the one that is more expensive to evaluate (in a simple heuristic, the one

    whose reference region is larger is eliminated) Ω

    (2) Simplification of Conjunctions and Disjunctions:

    In our definitions, the topological relations have subset relationships as shown in Section 2. Therefore, if

    a conjunction of ti(r, qi) and tj(r, qj) satisfies equal_to(qi, qj) and ti ⊆ tj, every r that satisfies ti with respect to qi satisfies tj with respect to qj. Thus, the predicate tj(r, qj) can be eliminated. Similarly, the disjunction

    of ti(r, qi) and tj(r, qj) can be simplified to tj(r, qj) if equal_to(qi, qj) ∧ ti ⊆ tj �

    3.2 Complex Topological Theta Joins

    Like complex topological selections, complex topological theta joins can have multiple conjunctions and

    disjunctions of query predicates. But, unlike the selections, all the query predicates constituting the

    complex condition of a join have the same reference object s.

    The complete set of contradictory conjunctions of join conditions and the simplification rules for

    conjunctions and disjunctions of join conditions can be easily derived from the corresponding set and

    algorithm in Section 3.1 by replacing the query regions q1 and q2 with s as shown in Table 3. Also, as

    discussed in Section 3.1(b), if a conjunction of ti(r, s) and tj(r, s) satisfies ti ⊆ tj, every r that satisfies ti with respect to s satisfies tj with respect to s. Thus, the predicate tj(r, qj) can be eliminated. Similarly, the

    disjunction of ti(r, s) and tj(r, s) can be simplified to tj(r, s) if ti ⊆ tj

    Table 3. Conjunctions of query predicates that yield empty results when their reference objects are the same O: Yield Empty Result X: Do Not Yield Empty Result

    equal_to (r, s)

    contains (r, s)

    covers (r, s)

    contained_by (r, s)

    covered_by (r, s)

    overlaps (r, s)

    meets (r, s)

    equal_to (r, s)

    X O X O X X O

    contains (r, s)

    O X X O O X O

    covers (r, s)

    X X X O X X O

    contained_by (r, s)

    O O O X X X O

    covered_by (r, s)

    X O X X X X O

    overlaps (r, s)

    X X X X X X O

    meets (r, s)

    O O O O O O X

  • 10

    ArcView

    Extractor Module

    Query Module

    Drawing Module

    Socket Client DLL

    DLL Call Interface

    Client

    Socket Server

    Relational DBMS

    Query Processor

    Server Interface Subroutines

    Spatial Query Optimizer

    Middle-Ware Sever

    Geographic Databases

    Database Sever

    4. THE RGIS AND EXPERIMENTAL RESULTS This section introduces the first prototype of our geographic information system called the relational

    geographic information system (RGIS) and presents some of our experimental results showing the query

    performance of the RGIS.

    4.1 The RGIS: System Architecture

    We have developed a GIS called the Relational Geographic Information System (RGIS)7. The current

    RGIS prototype is a three-tier system (see Figure 6). The middle-ware server connects the databases to

    the client system. The client’s user interface is a customized ArcView equipped with our additional

    buttons, dialogs, menus, and some extended modules. The extended modules are the Extractor Module,

    the Query Module, and the Drawing Module. The Extractor Module transfers GIS data from the client to

    the server by reading data from ArcView or ArcInfo’s internal format and sending it to our DLL

    (Dynamic Linking Library). The Query Module allows end-users to visually make complex queries (in

    the current system, complex 2D queries). The users can draw arbitrary query regions or choose some of

    the existing data objects as query regions. A query is sent to the server and the query result is stored in

    ArcView’s internal format. This allows the user to view and modify the data using ArcView. The

    Drawing Module displays the query results in the map view format for spatial data and in tabular format

    for non-spatial data.

    Figure 6. The Architecture of the first RGIS prototype

    The architecture design shown in Figure 6 allows for the replacement of the client with a different

    user interface such as a web browser based application. For example, a web browser based interface could

    be distributed to run in a view-only mode for the users who only need to view GIS data and not update it.

    7 Funded by Informix Software Grant (2001) and NSF EPSCoR Starter Grant (2001, NSF EPSCoR NSFLOC4304).

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    11

    (We are currently developing this type user interface in Java and HTML.)

    The server side components for the RGIS system are implemented in Linux. The Spatial Query

    Optimizer is based on the techniques introduced in Section 3 and some additional heuristics. The Query

    Processor handles query evaluation plans and other client requests through ESQL/C-based modules. The

    Query Processor is the bridge between the rest of the system and the underlying Database Server. The

    prototype uses an Informix Universal Server 2000 (a commercial database management system) running

    on a Linux machine, but any relational or object-relational database management system (DBMS) that

    supports dynamic or embedded SQL could be used since, the design uses standard SQL-92 syntax where

    possible. However, a few minor changes may be needed to support vendor-specific database management

    commands. We are currently developing the query processor for a MySQL DBMS.

    The RGIS prototype is equipped with a generic GIS schema called the Generic Relational-

    Geographic-Information Schema (GRGIS) and a schema generator called the automatic Schema

    Generation mechanism for GIS applications (SGGIS) (the details of GRGIS and SGGIS can be found in

    Appendix. The GRGIS can be automatically specialized by the SGGIS to accommodate any GIS dataset

    whose data objects are represented based on 2D geometry with linear interpolation between vertices. The

    GRGIS and SGGIS are based on the relational database model and fully compatible with the OpenGIS

    Features Specification8 for SQL developed by OGC (OpenGIS Consortium) [13]. The GRGIS and SGGIS

    facilitate the migration and deployment of GIS data in well-established relational database environments.

    Consequently, sharing and integrating GIS data become much more feasible. In addition, since any

    DBMS that supports relational database model and SQL can be used, developing a GIS application

    system on existing GIS data is facilitated.

    4.2 Exper imental Results We have run several sample queries through the RGIS. The server machine was a Linux workstation

    equipped with an 800 MHz Intel Pentium III and EIDE Maxtor 52049H4 20419 MB hard drive. The first

    query, described in the table 4, shows a clear advantage of the verification-optimization steps. In this

    sample query, a simple predicate that queries for ZIP codes in the state of Colorado was “anded” with a

    disjunction of three simple topological predicates whose query (reference) regions are Platte river,

    Colorado river, and Arkansas river, respectively.

    Table 4 shows that the optimizer can save time by catching inadvertent spatial predicates that

    should not be in the expression. The following queries are some of the other queries we ran through the

    RGIS:

    8 In May 1999, OGC (OpenGIS Consortium) released the OpenGIS Simple Features Specification for SQL [5]. The purpose of this specification is to define a standard SQL schema that supports storage, retrieval, query and update of a collection of geospatial features. In the specification, geometric features are represented based on 2D geometry with linear interpolation between vertices. The specification also gives the definitions of spatial relations based on the interiors, exteriors, and the boundaries of participating objects.

  • 12

    Query 2: Select the ZIP codes that overlap Pecos river, Brazos river, Red river, or Canadian river

    and that are covered by the state of Texas.

    Query 3: Select the ZIP codes that overlap Mississippi river, Missouri river, Illinois river,

    Tennessee river, or Ohio river and that are covered by the state of Illinois, Indiana, or

    Kentucky.

    Query 4: Select all Colorado ZIP codes that overlap Platte river, Colorado river, or Arkansas river

    or that are covered by Colorado river or Arkansas river.

    Table 4. Query 1 DESCRIPTION Select all ZIP codes that overlap the Platte river9 (P), Colorado river (C), or

    Arkansas river (A) and that belong to the state of Colorado (CS). σcovered_by(ZIP.geometry,CS)∧(overlaps(ZIP.geometry,P)∨overlaps(ZIP.geometry,C)∨overlaps(ZIP.geometry,A))(ZIP) The MBR size of Colorado State:

    28.1800

    River MBRs (total size): 158.1072 Platte MBR: 3.95 Colorado MBR: 68.79

    STATISTICS

    Arkansas MBR: 85.37 without the verification-optimization steps

    with the verification-optimization steps TURNAROUND TIME

    0.62 seconds 0.56 seconds EXECUTION NOTES

    the contradictory conjunction covered_by(ZIP,CS)∧overlaps(ZIP,P) was found and overlaps(ZIP,P) predicate was eliminated.

    Figure 7 shows the performance of the RGIS for these given queries.

    5. CONCLUSIONS Just a few decades ago, paper maps were the principal means of synthesizing and representing geographic

    information. Paper maps are limited to manual manipulation and fail to meet the increasing demand for

    interactive manipulation and analysis of geographic data. The rapid development of new computer

    software and hardware technologies has made meeting this demand possible: various types of geographic

    information systems (GIS’s) that can replace traditional paper maps have been developed. In recent years,

    a GIS has become more than a cartographic tool to produce digital maps. A GIS provides storage,

    management, and retrieval of geographic spatial data (e.g., the boundaries of lakes) and related non-

    9 The Platte river in the query is not the North Platte river or the South Platte river but the Platte river that starts in Nebraska. The South Platte river originates as snowmelt in central Colorado. From its source, the river flows southeastward, then north-northeastward, and, after crossing the Colorado-Nebraska border, joins the North Platte river that flows from north central Colorado. The Platte River originates at the confluence of the North Platte and South Platte rivers near North Platte, Nebraska.

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    13

    0.00

    1.00

    2.00

    3.00

    4.00

    5.00

    6.00

    7.00

    Tu

    rnar

    ou

    nd

    Tim

    e(se

    c)

    Query 1 Query 2 Query 3 Query 4

    spatial data (e.g., names, sizes, and average water temperatures of lakes). The GIS application domain

    spans many areas including Urban Planning, Route Optimization, Public Utility Network Management,

    Demography, Cartography, Agriculture, Natural Resources Management, Coastal Monitoring, Fire

    Control, and Epidemic Monitoring.

    Figure 7. The query performance of the RGIS

    In practical scientific or analytical GIS applications, queries are often complex, and there is an

    increasing demand for database-supported GIS’s that are streamlined for handling complex statistical or

    analytical queries. Therefore, processing complex topological queries that involve multiple query regions

    and topological relations over existing GIS data is required. In most GIS’s, users frequently execute

    multiple individual operations recursively to collect or analyze data due to a marked lack of adequate

    query processor for complex spatial queries.

    The paper presented the design, implementation, and test of an advanced processing technique for

    complex topological queries. This technique can improve the performance and functionality of numerous

    spatial applications. More specifically, the main contributions of this research include the followings:

    - The RGIS: an advanced, customizable, academic testbed for GIS related research.

    - An efficient processing technique for simple topological queries: represents the state of the art

    and improves the performance of a majority of GIS’s.

    - A support for complex topological queries: improves the performance and functionality of a

    majority of GIS’s.

    Our RGIS technology (including not only the query processing techniques introduced in this

    paper but also the automated GIS database schema generation that can be found in Appendix) represents

    an elaborate accumulation of many years of research. Nevertheless, we believe that there is still a

    significant room for improvement.

  • 14

    First, the refinement step verifies the topological relationship of actual regions instead of their

    MBRs. The seven topological relations in the first column of Table 1(b)10 can be answered with one of the

    more general computational geometry algorithms that compute overlap of two polygons. Unfortunately,

    the polygon intersection problem is complex and very CPU-intensive. The most common approach is to

    subdivide the polygon into simpler polygons for which above tests can be performed in a very efficient

    manner. Polygon triangulation, clipping, trapezoidal decomposition, and subdivision into simple polygons

    are some of the options. In some cases it is necessary to pre-compute a data structure to compute overlap

    efficiently. For example, once a trapezoidal decomposition is given of a polygon, it is easy to determine

    whether another polygon intersects it. Construction of such a decomposition is quite expensive, it requires

    O(n2) space, which might be prohibitively expensive for complex geographical regions. We propose to

    investigate the tradeoffs between computing the overlap on the fly and pre-computing more efficient data

    structures and saving them in a database for later retrieval. In our future work, we will adapt some of the

    classical algorithms [3], as well as some recently proposed heuristics to compute the overlap [1].

    Second, complex topological queries can be further optimized. Efficient approximation and

    maintenance of the spatial statistics of a spatial database is still an ongoing research. Moreover, the cost

    models for the performance estimation of multidimensional access methods are tend to be much more

    complex (as a result they require more time for evaluation) than the cost models for conventional one-

    dimensional access methods, such as B+-trees and Hashing. However, as these research areas become

    mature, the idea of cost-based optimization of complex topological queries will be much more feasible.

    REFERENCES [1] W. Badawy and W.G. Aref, “On Local Heuristics to Speed Up Polygon-Polygon Intersection Tests,” Proc. of

    ACM-GIS the International Symposium on Advances in Geographic Information Systems, pp. 97-102, Kansas City, MO, November 1999.

    [2] N. Beckmann, H. Kriegel, R. Schneider and B. Seeger, “The R*-tree: An Efficient and Robust Access Method

    for Points and Rectangles,” Proc. ACM SIGMOD Int. Conf. on Management of Data, 322—331, 1990. [3] M. de Berg, M. Kreveld, M. Overmars, O. Schwarzkopf, “Computational Geometry: Algorithms and

    Applications," Springer-Verlag, Berlin 1997 (http://www.cs.ruu.nl/geobook/). [4] M.J. Egenhofer, “Reasoning about Binary Topological Relations,” O. Gunther and H. –J. Schek, editors, SSD

    International Symposium on Advances in Spatial Databases, LNCS Lecture Notes in Computer Science, pp. 143-160, Springer-Verlag, 1991.

    [5] Environmental Systems Research Institute, Inc., “ESRI Shapefile Technical Description,” ESRI White

    Technical Paper, July, 1998 (http://www.esri.com).

    10 In addition, one can consider recently proposed 3D topological relations [43] as well as the original Egenhofer’s topological relations [7] to possibly extend the set of topological relations.

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    15

    [6] Environmental Systems Research Institute, Inc., “Getting to Know ArcView GIS 3rd Edition,” ESRI Press, Redlands, California, 1999.

    [7] M. Grigni, D. Papadias, and C. Papadimitriou, “Topological Inference,” Proc. IJCAI Internal Joint

    Conference on Artificial Intelligence, pp. 901-906, 1995. [8] O. Gunther and J. Bilmes, “Tree-Based Access Methods for Spatial Databases: Implementation and

    Performance Evaluation,” IEEE Trans. Knowledge and Data Engineering, 3(3):342—356, 1991. [9] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Int. Conf. on

    Management of Data, 47—54, 1984. [10] Kamel and C. Faloutsos, “Hilbert R-tree: An Improved R-tree Using Fractals,” Proc. 20th Int. Conf. on Very

    Large Data Bases, 500—509, 1994. [11] D. Mark and M. Egenhofer, “Calibrating the Meaning of Spatial Predicates from Natural Language: Line

    Region Relations,” Proc. SDH International Symposium on Spatial Data Handling, pp. 538-553, 1994. [12] P. Oosterom, “Reactive Data Structures for Geographic Information Systems,” Ph.D. Thesis, University of

    Leiden, Netherlands, 1990. [13] Open GIS Consortium, “OpenGIS Simple Features Specification for SQL Revision 1.1.,” OpenGIS Project

    Document 99-049, May 5, 1999. [14] J. Orenstein and T.H. Merrett, “A Class of Data Structures for Associative Searching,” Proc. 3rd ACM

    SIGACT-SIGMOD Symposium on Principles of Database Systems, 181—190, 1984. [15] R. Orlandic, “A High-Precision Spatial Access Method Based on a New Linear Representation of Quadtrees,”

    Proc. 1st Conf. on Information and Knowledge Management CIKM-92, 499—508, 1992. [16] R. Orlandic and B. Yu, "A Study of MBR-Based Spatial Access Methods: How Well They Perform in High-

    Dimensional Spaces," Proceedings IEEE IDEAS International Database Engineering and Applications Symposium, 306-315, 2000.

    [17] R. Orlandic and B. Yu, “Scalable QSF-Trees: Retrieving Regional Objects in High-Dimensional Spaces,”

    JDM Journal of Database Management, Idea Group Publishing, accepted in 2003. [18] D. Papadias, Y. Theodoridis, T. Sellis, and M. J. Egenhofer, “Topological Relations in the World of Minimum

    Bounding Rectangles: A Study with R-trees,” Proc. of the ACM SIGMOD International Conference on Management of Data, pp. 92-103, 1995.

    [19] D. Randell, Z. Cui, and A. Cohn, “A Spatial Logic Based on Regions and Connection,” Proc. KR’92

    Conferences on Principles of Knowledge Representation and Reasoning, pp. 165-176, 1992. [20] J. Renz and B. Nebel, “Spatial Reasoning with Topological Information,” C. Freksa, C. Habel, and K.

    Wender, editors, Spatial Cognition – An Interdisciplinary Approach to Representation and Processing of Spatial Knowledge, LNCS Lecture Notes in Computer Science, pp. 351-372, Springer-Verlag, 1998.

    [21] T. Sellis, N. Roussopoulos, and C. Faloutsos, “Multidimensional Access Methods: Trees Have Grown

    Everywhere,” Proc. VLDB Very Large Databases Conference, pp. 13-14, 1997. [22] A. Silberschatz, H. Korth, and S. Sudarshan, “Database System Concepts, Fourth Edition,” ISBN 0-07-

    228363-7, McGraw-Hill, New York, 2002.

  • 16

    [23] D.A. White and R. Jain, “Similarity Indexing with the SS-tree,” Proc. 12th IEEE Int. Conf. on Data Engineering, 516—523, 1996.

    [24] B. Yu and R. Orlandic, "Object and Query Transformation: Supporting Multi-Dimensional Queries in

    Transactional Systems," Proceedings ACM CIKM International Conference on Information and Knowledge Management, 141-149, 2000.

    [25] B. Yu, R. Orlandic, and M. Evens, “Simple QSF-trees: An efficient and scalable spatial access method,” Proc.

    ACM CIKM Internal Conf. on Information and Knowledge Management, pp. 5-14, 1999.

    APPENDIX Relational Geographic Databases: The Schema Design

    Please note that this appendix is prepared for the reviewers. A shorter version (i.e., an extended summary) of this appendix was published in the proceedings of IIIS SCI 2003 (the paper is also available on-line at www.cs.uwyo.edu/~yu/GIS/). This appendix proposes a generic relational-database schema that can efficiently accommodate various types of GIS data. The proposed schema complies with the OpenGIS Simple Features Specification for SQL developed by OGC (OpenGIS Consortium) and can be used for any geographic application whose geographic objects are represented based on 2D geometry with linear interpolation between vertices. The generic schema that we propose in this appendix makes it possible to automatically generate a relational database schema for any existing or new 2D GIS dataset. This facilitates the migration and deployment of GIS data in well-established relational database environments. Consequently, sharing and integrating GIS data become much more feasible. In addition, since any relational database management system (DBMS) can be used, developing a GIS application system on existing GIS data is facilitated. We verified the proposed schema and automatic schema generation mechanism by developing and testing a relational geographic information system. A.1 Introduction Just a few decades ago, paper maps were the principal means of synthesizing and representing geographic information. Paper maps are limited to manual manipulation and fail to meet the increasing demand for interactive manipulation and analysis of geographic data. The rapid development of new computer software and hardware technologies has made meeting this demand possible: various types of geographic information systems (GIS’s) that can replace traditional paper maps have been developed. In recent years, a GIS has become more than a cartographic tool to produce digital maps. A GIS provides storage, management, and retrieval of geographic spatial data (e.g., the boundaries of lakes) and related non-spatial data (e.g., names, sizes, and average water temperatures of lakes). The GIS application domain spans many areas including Urban Planning, Route Optimization, Public Utility Network Management, Demography, Cartography, Agriculture, Natural Resources Management, Coastal Monitoring, Fire Control, and Epidemic Monitoring. In recent years, there is an increasing demand for database-supported GIS’s that are streamlined for handling complex statistical or analytical queries. Most modern GIS’s have been developed based on a file system. As a result, each GIS has its own logical data formats and file structures. Unfortunately, these file-system-based GIS’s have several well-known problems that have been found in the area of databases: data sharing, data redundancy and inconsistency, transaction control and recovery, concurrency control, and security. The most feasible approach to these problems is building a GIS based on a well-established database model [6]. Maybe, the most successful database model that has been proven to effectively attack the problems of early file systems in a reliable manor is the relational database model – almost all database management systems (DBMS’s) support relational database model. Previous works related to developing a GIS based on database technology are categorized into two approaches: hybrid approach [15, 16] and integration approach [12, 13, 14, 21, 22]. The hybrid approach uses a DBMS to store and manage non-spatial data, and spatial data is separately managed by either a proprietary file system (e.g., ARC/INFO) [15] or a spatial data manager (e.g., Papyrus) [16]. On the other hand, the integration approach extends the ER-model (the relational database model) by adding new data types and operations to capture spatial semantics [12, 13, 21, 22] and requires the DBMS to support user defined ADTs (Abstract Data Types) and operations [14, 21, 22]. In these systems, the major problem is data sharing and migration among heterogeneous systems. This appendix proposes a highly flexible and portable GIS schema called the Generic Relational-Geographic-Information Schema (GRGIS) that can be automatically specialized to accommodate any GIS dataset whose data

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    17

    objects are represented based on 2D geometry with linear interpolation between vertices. The schema is based on the relational database model and a widely-used standard SQL (SQL92) [6] and fully compatible with the OpenGIS Simple Features Specification for SQL developed by OGC (OpenGIS Consortium) [5]. This appendix also proposes our technique called the automatic Schema Generation mechanism for GIS applications (SGGIS) that can automatically generate a relational database schema, given a GIS dataset. The GRGIS and SGGIS facilitate the migration and deployment of GIS data in well-established relational database environments. Consequently, sharing and integrating GIS data become much more feasible. In addition, since any database management system (DBMS) that supports the basic relational database model and SQL can be used, developing a GIS application system on an existing DBMS and reusing existing sets of geographic data are facilitated. An experimental GIS system called the RGIS (Relational Geographic Information System) has been developed to verify the GRGIS and the SGGIS. The remainder of this appendix is organized as follows: Chapter 2 gives an overview of commonly used GIS data models. Chapter 3 introduces the GRGIS and SGGIS. Chapter 4 shows our experimental results. Finally, we provide the summary and discuss our future work in Chapter 5. A.2 Backgrounds In many ways, a GIS presents a simplified view of the real world. Each geographic data object associates with two kinds of data: non-spatial data and spatial data. Non-spatial data of a geographic object consists of alphanumeric values describing (or being associated with) the object. Spatial data of a geographic object represents geometric properties of the object. Geometric properties of a geographic object define the geometry (i.e., geometric figure) of the object by defining the interiors, the boundaries, and the exteriors of the object [8]. Existing spatial data models can be classified into two groups depending on how they view the real world: field model and object model [1, 2, 3, 4]. The field model views the world as a continuous surface over which features vary in a continuous distribution (e.g. atmospheric pressure). In this model, the world (i.e., a field) is partitioned into areas, and the emphasis is on the contents of these areas. The object model thinks of the world as a surface littered with recognizable objects. Another classification of spatial data is based on the representation of spatial data: raster representation and vector representation. Typically the field model is developed based on the raster representation. The vector representation, on which the object model is implemented, explicitly stores the geometric features of the identified geographic objects (typically obtained from raster data). It takes much less storage space and provides efficient geometrical and topological operations. Although the field model is still used in some applications such as atmosphere GIS applications and environmental GIS applications, the object model is becoming widely accepted, this is because of the fact that geometrical and topological operations are necessities for an increasing number of emerging GIS applications. In this appendix, we focus on the object model and the vector representation.

    A.2.1 Object Models based on Vector Representation In the vector representation, objects are constructed from points as primitives. A point is represented by a pair of X, Y coordinates, whereas more complex linear and region objects are represented by structures (lists, sets) on their point representation. Considering collections of geographic objects, interests are also given to the representation of topological relationships among geographic objects. Differing in the expression of topological representation, the representations of geographic-objects collection are usually classified into two models: spaghetti representation and topological representation. In the spaghetti representation, the geometric properties of any spatial object are described independently of other objects. No topological relations are stored, and all topological relations are computed on demand. On the other hand, the topological representation describes geometrical properties in terms of node, arc, polygon, region, and the topological relations among them. For example, a node is represented by a point and a list of arcs starting (or ending) at this node; an arc is represented by its ending nodes and the polygons having the arc as a common boundary; a polygon is represented by a list of arcs. The main advantage of the spaghetti representation is its simplicity. The drawbacks of this model are mainly due to the lack of explicit information about topological relations among spatial objects. In addition, the spaghetti representation implies data redundancy. For example, the coordinate values representing a boundary shared by two adjacent regions are duplicated. The topological representation can efficiently support some topological queries. For example, looking for polygons adjacent to a given polygon P1 is straightforward. P1 is scanned, and accessing each of its arcs provides a

  • 18

    polygon adjacent to P1. However, if P1 is not one of the data objects, this does not work, and such topological queries can be better supported by spatial access methods (e.g., topological R*-trees [9] and QSF-trees [10]). In fact, spatial access methods can be more easily built on the datasets represented in the spaghetti representation. Moreover, the topological representation shows lower performance of generating query results for display, because each object’s actual coordinate values are replaced by other objects’ identifications. That is, a larger number of objects should be accessed. This results in an increased number disk accesses. Due to these facts, for GIS’s dealing with a large volume of geographic data, the spaghetti representation is preferable.

    A.2.2 Data Management A GIS (Geographic Information System) needs to store both spatial data and non-spatial data. Early GIS’s were built on top of proprietary file systems. This allows the system performance to be optimized for some functions, but does not respect the important data independence principle, and leads to many drawbacks in terms of data sharing among different GIS’s, data consistency, data recovery, security, and data integrity. These problems can be solved by developing GIS’s on top of a well-developed Database Management System (DBMS). Although individual DBMS’s have their own extensions, in recent years, all general purpose DBMS’s have been developed based on well-established relational-database technology. Thus, a GIS application that is designed based on the relational database model and a standard SQL (e.g., SQL 92) can be implemented on any relational DBMS without significant modifications. Furthermore, the problems of data sharing among different GIS applications, data consistency, data recovery, security, and data integrity can be effectively solved. The efforts of deploying geographical information in relational DBMS’s can be classified into two approaches: hybrid approach and integration approach. In the hybrid approach, spatial data is stored in external files, and only non-spatial data is managed by a DBMS. The spatial data is accessed by geographic object identifier (gid) (each spatial object has a unique gid value), and all gid values are also stored (as attribute values) in the non-spatial database to connect the spatial data and the non-spatial data of each geographical object. Arc/Info (ESRI), MGE (Intergraph), and TiGRis (Intergraph) are well-known GIS’s that follow this approach. This approach provides more flexibility and greater potential for data sharing and integration. However, this approach also suffers from some drawbacks among which the most important are: (1) the coexistence of heterogeneous data models, which implies difficulties in modeling and sharing of spatial data; and (2) the partial loss of basic DBMS functionalities, such as data recovery and query optimization regarding the spatial part of data. The integrated approach employs a DBMS to store both spatial data and non-spatial data. This approach eliminates drawbacks of the file-based approach and the hybrid approach, since both spatial data and non-spatial data are stored in a database. Thus, data sharing, data integrity, data recovery, security, and data consistency are guaranteed by the DBMS to a great extent (assuming that the DBMS is a reliable DBMS, such as Informix, Oracle, and DB2, supporting concurrency control, recovery, and security). Unfortunately, the existing implementations are based on a specially extended ER-model with new geographic constructs, ADTs (Abstract Data Type) of spatial objects, and additional operations on these ADTs [12, 13, 14, 21, 22]. These implementations suffer from one or more of the followings: (1) limited representation of complex polygons (e.g., the number of the boundary points must be less than 2000 in [15]), (2) poor performance, (3) application dependent database schema, and (4), most importantly, since the implementations make use of the extended features of a specific commercial DBMS, data sharing and integration are rather cumbersome.

    A.2.3 The OpenGIS Standard for SQL Environment In May 1999, OGC (OpenGIS Consortium) released the OpenGIS Simple Features Specification for SQL [5]. The purpose of this specification is to define a standard SQL schema that supports storage, retrieval, query and update of a collection of geospatial features. In the specification, geometric features are represented based on 2D geometry with linear interpolation between vertices. For the implementation, two target SQL environments can be considered: SQL92 and SQL92 with Geometry Types. SQL92, also known as SQL2, has been widely supported by commercial DBMS products. SQL92 with Geometry Types requires the underlying DBMS to support some geometry types and SQL3 features, which allows the users to extend the type system by defining ADTs (Abstract Data Types) and operations. However, these extended SQLs are not widely supported by DBMS’s or standardized.

    A.2.4 Motivations We initiated this research to attack the problem of geographic information sharing. The goal of this research is to develop a Generic Relational-Geographic-Information Schema (GRGIS) and an automatic Schema Generation

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    19

    mechanism for GIS applications (SGGIS). A GRGIS is a high level relational abstraction of all types of geographic data (both spatial and non-spatial) of the OGC Simple Features (see Section A.2.3), and an SGGIS is a mechanism that can generate, without human intervention, a specific relational geographic database schema given a geographic dataset. Given a geographic dataset, the GRGIS and SGGIS that we propose in this appendix do not separate spatial data and non-spatial, but store them together in a single database so as to take full advantage of the relational database technology (i.e., data sharing, data integrity and constancy, recovery, security, and optimized query processing). A.3 GRGIS and SGGIS In this section, we introduce our design of the Generic Relational-Geographic-Information Schema (GRGIS) and automatic Schema Generation mechanism for GIS applications (SGGIS). The GRGIS adopted spaghetti representation, which makes the schema simple and efficient. Topological queries can be efficiently supported through spatial access methods and point access methods. Using efficient multidimensional access methods and simple spaghetti representation is more efficient than using complex topological representation in processing topological queries. The reasons for this include: (1) Topological queries can involve an arbitrary query region and one of the eight topological relations defined in [9, 10]. If the query region of a given query is not one of the data objects, keeping track of all the topological relations between data objects is not helpful in processing the query; (2) As mentioned in Section A.2.1, in the topological representation, to find the exact coordinates of the points constituting the result set of a query, we may need to read the coordinates of some other objects. This results in an increased number of storage accesses. The SGGIS is the mechanism for generating schema of specific GIS applications.

    A.3.1 GRGIS Various types of geometric features are defined in the OpenGIS Simple Features Specification for SQL. Moreover, each GIS defines its own set of geometric features. Through the rest of the appendix, to minimize the redundancy in describing our GRGIS and SGGIS, we discuss only a representative subset of the geometric features, more specifically point, multipoint, polyline, and polygon. Gener ic geometr ic features. This Section gives the definitions of generic geometric features. Definition 1. A point is stored as and represents a single location in coordinate space. Definition 2. A multipoint is a set of points representing a conceptual object or group. There is no order among points, and the number of points varies from one set to another. A multipoint is stored as { , , …, } where n ≥ 1 is the number of member points. Definition 3. A polyline is a set of one or more parts. A part is a sequence of two or more connected points (vertices) with linear interpolation between points. In a part, each consecutive pair of points defines a line segment. Parts may or may not be connected to one anther. Parts may or may not intersect one another. The number of parts and the number of points in each part vary. A polyline is stored as { , ….,} where m ≥ 1 and, for all i= m,1 n[i] ≥ 2. Note that m is the number of parts and n[i] is the number of vertices constituting the i th part. Definition 4. A polygon is a set of one or more exterior rings and zero, one, or more interior rings. Each interior ring defines a hole in the polygon. A ring is a sequence of three or more connected points (vertices) that form a simple and closed loop. In a ring, each consecutive pair of points defines a line segment, and the last point is connected to the first point to form a closed region. No two line segments of a ring cross. No two rings of a polygon cross, the rings of a polygon may intersect at a point but only as a tangent [5]. The order of vertices of a ring indicates which side of the ring is the interior of the polygon. While the ring representing the outer boundary of a polygon is represented by the clockwise sequence of the component points (vertices), each interior ring representing a hole is represented by the counterclockwise sequence of the vertices. A polygon is stored as {

  • 20

    >, ….,} where m ≥ 1 and, for all i= m,1 n[i] ≥ 3. Note that m is the number of rings and n[i] is the number of vertices constituting the i th ring. The above definitions cover geometries of Point, MultiPoint, Line, MultiLineString, Polygon, and MultiPolygon defined in [5]. Our definitions are simple, and have high expressive power. The GRGIS and SGGIS can be extended to accommodate other types, like Curve and MultiCurve. Geographic object sets (Themes). Our model is a theme-based model. A theme is a collection of objects that have the same type of geometric feature (i.e., point, multipoint, polyline, or polygon) and the same set of non-spatial attributes (i.e., all objects in the same theme have the same spatial data structure and the same non-spatial data structure). This definition of themes coincides with that of ArcView themes [7, 18]. Although some GIS’s, such as Arc/Info, allow the users to put objects that have different types of geometric features in the same theme, such objects can be always separated into different themes. Definition 5. A theme is a collection of geographic objects that have the same type of geometric feature and the same set of non-spatial attributes. For point themes, each object’s spatial (geometric) properties represent only its location. For this, two attributes X_COORD and Y_COORD are necessary: Each element of the Cartesian product of their domains represents a unique location (in most cases, a pair of a longitude value and a latitude value). In addition, for each theme, a primary key, which consists of one or more non-spatial attributes, is required to identify objects. In a multipoint theme, each object consists of a various number of points. Thus, the member points can be modeled as weak entities each of which is dependent on a single multipoint object. Our model can support analytical queries and statistical queries such as “Report the total, mean, and variance of the populations of the cities whose population is greater than 20000 in Wyoming.” To efficiently answer such queries, the system must be able to efficiently find spatial objects having a certain topological relation with respect to a given spatial query object. For example, to process the query above, all cities that are contained by the given spatial object (the region of Wyoming) must be found first. Then the population values of the cities are retrieved to compute the total, mean, and the variance. In the spaghetti representation, topological relations between spatial objects are tested on the fly. To process a topological query, spatial access methods can be used to quickly find the candidate objects whose approximations (e.g., minimum bounding rectangles) satisfy the given topological predicate with respect to the approximation of the given query region. Then, the actual spatial properties of the candidate objects are referenced to delete the objects whose actual regions do not satisfy the given topological predicate with respect to the actual query region. This well-known processing technique for geographic queries is called two-phase spatial query processing [9, 10]. The most frequently used spatial approximation is MBR (minimum bounding rectangle) [9, 10]. For this purpose, each multipoint object has its own abstraction represented by the center of the point group and the minimum bounding rectangle (MBR) that minimally encloses all the member points. In our model, the center point and the MBR are represented by and , respectively. Figures 1a and 1b show the ER diagrams representing a generic point theme and a generic multipoint theme, respectively. We modeled polyline themes and polygon themes in the same sense. Figures 2a and 2b show a generic theme of polyline objects and a generic theme of polygon objects, respectively. Note that, we store a set (or sequence) of points as a string (the “POINTS” multi-value attribute of Figure 1b). Since the length of the string vary, the data type varchar[n] can be used to eliminate wasted storage space. The maximum length n is determined by the DBMS (typically, 255). Since a multipoint object, a part of a polyline, or a ring of a polygon can have a large number of points (or vertices) that require more than n characters, the discriminator attribute SEQ is used to connect all string chunks constituting a single point set, a part of a polyline, or a ring of a polygon. ER diagrams in Figures 1 and 2 constitute 4 types of instance-level schemas. These generic instance-level schemas need be specialized to create a geographic database. In each specialized ER diagram, italic words are replaced with real words. For example, to create a point theme “City” in which each city has a location, city name, state name, population, and size and in which the primary key consists of city name and state name, the generic point object entity (Figure 1(a)) is specialized as follows: (1) the entity name “point object” is replaced by City; (2) “key attributes” is replaced by CITY_NAME and STATE_NAME; (3) “non-spatial attributes” is replaced by POPULATION and SIZE.

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    21

    (a) (b)

    Figure 1. ER diagrams representing a generic theme of point objects (a) and a generic theme of multipoint objects (b): A bolded ellipse represents a set of attributes; A double rectangle represents a weak entity set; The double ellipse superscripted by n represents a member-points array of maximum length n; The mapping cardinality of has is many (multipoint feature) to one (multipoint object); SEQ is the discriminator. Note that the union of key attributes and non-spatial attributes constitute the non-spatial attributes of the geographic objects

    (a) (b)

    Figure 2. ER diagrams representing a generic theme of polylines objects (a) and a generic theme of polygon objects (b): A bolded ellipse represents a set of attributes; A double rectangle represents a weak entity set; Each of the double ellipses superscripted by n represents a vertex-points sequence of maximum length n. Note that the union of key attributes and non-spatial attributes constitute the non-spatial attributes of the geographic objects. Geographic databases. We view a geographic database as a set of themes. Therefore, in our model, a geographic database is represented by a set of themes. Each theme can have its own attributes, such as the name, feature type (e.g., polylines), and the area covered by the theme. In addition, considering data sharing, since different GIS’s possibly provide different set of primitive data types, exactly matching each primitive data type of a GIS with one of the primitive data types of a DBMS (database management systems) is not always possible. The most feasible approach to this type conversion problem is assigning more general data types to an attribute if there is no exactly matching data type supported by the underlying DBMS. Additional information about the origin of the theme, i.e. the release of the GIS (GIS software name and the version) in which the theme was created, and the original data types are stored in the database. Because of these reasons, we defined a meta-level entity set as shown in Figure 3: each meta-level entity describes a single theme, and each theme is described by a single meta-level entity. In our model, a geographic database is defined as follows:

    point object key attributes

    X_COORD Y_COORD

    non-spatial attributes

    multipoint object

    key attributes

    MIN_Y MAX_Y

    non-spatial attributes

    MIN_X

    CX_COORD CY_COORD

    MAX_X

    multipoint feature

    SEQ POINTS

    has

    n

    polyline object

    key attributes

    MIN_Y MAX_Y

    non-spatial attributes

    MIN_X MAX_X

    polyline feature

    PART_NO, SEQ_NO VERTICES

    has

    n

    polygon object

    key attributes

    MIN_Y MAX_Y

    non-spatial attributes

    MIN_X

    CX_COORD CY_COORD

    MAX_X

    polygon feature

    RING_NO, SEQ_NO VERTICES

    has

    n

  • 22

    Definition 6. A geographic database is a union of a set of themes and a set of Theme entities. Theme entities are meta-level entities each of which describes one theme, and each theme is described by one Theme entity (there is a one-to-one relationship, called described_by, between themes and Theme entities).

    Figure 3. ER diagram representing meta-level entities A.3.2 SGGIS: Schema Generation mechanism for GIS applications Creating a geographic database starts with specializing the ER diagrams introduced in Section A.3.1, by replacing italic words with actual words. A mechanism or algorithm is needed to generate the actual schema for each specific GIS application. SGGIS – Schema Generation mechanism for GIS application is designed for this purpose. The two-level structure of the GRGIS makes the schema generation process automatic without human intervention or code modification. The instance level schema depends on the contents of the meta-level entities. For example, the theme name gives the name of the corresponding data table (data relation), and the GEOMETRY_TYPE determines overall structure of the data relation. Given a set of themes, the SGGIS creates a geographic database as follows: Create a database. create database dat abase- name; create table Theme ( NAME varchar[MAX_NAME_LENGTH] not null, ORIGIN varchar[MAX_ORIGIN_LENGTH] not null, GEOMETRY_TYPE integer, MIN_X real, MIN_Y real, MAX_X real, MAX_Y real, DESCRIPTION varchar[MAX_DESCRIOTION_LENGTH], primary key (NAME, ORIGIN) ); create table Attribute_Description ( TNAME varchar[MAX_NAME_LENGTH] not null, TORIGIN varchar[MAX_ORIGIN_LENGTH] not null, ATTRIBUTE_NAME varchar[MAX_ATTRIBUTE_NAME_LENGTH] not null, ORG_TYPE_NAME varchar[MAX_ORG_TYPE_NAME_LENGTH], MIN_WIDTH integer, MAX_WIDTH integer, PRECISION integer, primary key (TNAME, TORIGIN, ATTRIBUTE_NAME), foreign key (TNAME, TORIGIN) references Theme (NAME, ORIGIN) ); create table Theme_Region ( TNAME varchar[MAX_NAME_LENGTH] not null, TORIGIN varchar[MAX_ORIGIN_LENGTH] not null, RING_NO integer not null, SEQ_NO integer not null, VERTICES varchar[MAX_STRING_LENGTH], primary key (TNAME, TORIGIN, RING_NO, SEQ_NO),

    Theme

    GEOMETRY_TYPE

    Attribute _Description

    ATTRIBUTE_NAME

    ORG_TYPE_NAME

    MIN_WIDTH

    PRECISION

    has attributes

    MAX_WIDTH

    NAME, ORIGIN

    MIN_X, MIN_Y, MAX_X, MAX_Y

    Theme _Region

    RING_NO, SEQ_NO VERTICES n

    covers

    DESCRIPTION

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    23

    foreign key (TNAME, TORIGIN) references Theme (NAME, ORIGIN) ); Create specialized themes. To create a theme, the low-endpoint (i.e., ) and the high-endpoint (i.e., ) of the MBR (minimum bounding rectangle) that minimally bounds the region covered by the theme are computed first. Then the theme is created as follows: insert into Theme values (…); for (i=0; i

  • 24

    ); // where A is the theme name, B is a list of attribute declarations with the “not null” // option, C is a list of attribute declarations, and D is the attributes declared in B, // and E is the union of D and { PART_NO, SEQ_NO} . else if the theme is a polygon theme then create table A ( B, MIN_X real, MIN_Y real, MAX_X real, MAX_Y real, CX_COORD real, CY_COORD real, C, primary key (D) ); create table A_feature ( B, RING_NO integer not null, SEQ_NO integer not null, VERTICES varchar[MAX_STRING_LENGTH], primary key (E), foreign key (D) references A (D) ); // where A is the theme name, B is a list of attribute declarations with the “not null” // option, C is a list of attribute declarations, and D is the attributes declared in B, // and E is the union of D and { RING_NO, SEQ_NO} .

    A.4 Conclusions The file-system-based GIS’s have several well-known problems: data sharing, data redundancy and inconsistency, transaction control and recovery, concurrency control, and security. The most feasible approach to these problems is building a GIS based on a well-established relational-database technology. In this appendix we proposed a generic GIS schema called the Generic Relational-Geographic-Information Schema (GRGIS) that can be automatically specialized and created to accommodate any GIS dataset whose geometric features are represented based on 2D geometry with linear interpolation between vertices. The schema is completely based on the basic relational database model and a widely-used standard SQL (SQL92). This makes the proposed schema highly portable. We also proposed a technique called the automatic Schema Generation mechanism for GIS applications (SGGIS) that can automatically generate a relational database schema, given a GIS dataset. The GRGIS and SGGIS can be used to:

    (1) convert any local GIS dataset (e.g., ArcView files [7, 18]) into a relational database, (2) combine multiple GIS datasets, (3) generate an intermediate dataset to pass GIS data among heterogeneous GIS’s, (4) propagate data updates through several heterogeneous GIS’s, (5) ensure data consistency among heterogeneous GIS’s, and (6) develop a new GIS on top of a DBMS without any separate file system for spatial data.

    The proposed GRGIS and SGGIS are fully compliant with the OpenGIS Simple Features Specification for SQL developed by OGC (OpenGIS Consortium) [5] and can be further extended to support more complex features that are based on 3D geometry with non-linear interpolation between vertices. We reserve this as our future work.

    A.5 References 1. Rigaux, P., Scholl, M., Vorsard, A.: Spatial Database with Application to GIS. Academic Press, ISBN 1-

    55860-588-6/G70.212.R54 (2002). 2. Goodchild, M.F.: Geographical Data Modeling. Computers & GeoSciences, Vol. 18, No.4. (1992) 401-408. 3. Medeiros, C.B., Pires, F.: Database for GIS. SIGMOD RECORD, Vol. 23, No. 1. (1994). 4. Frank, A.U.: Spatial Concepts, Geometric Data Models, and Geometric Data Structures. Computers &

    GeoSciences, Vol. 18, No. 4. (1992) 409-417.

  • B. Yu & C.A. Marron, A manuscript for Software Practive & Experience 8/6/03

    25

    5. Open GIS Consortium: OpenGIS Simple Features Specification for SQL Revision 1.1. OpenGIS Project Document 99-049 (May 5, 1999).

    6. Ramakrishnan, R., Gehrke, J.: Database Management Systems 2nd edition. Thomas Casson, ISBN 0-0702322206 / QA76.9.D3R237 (2000).

    7. Environmental Systems Research Institute, Inc.: ESRI Shapefile Technical Description. ESRI White Technical Paper, http://www.esri.com (July, 1998).

    8. Egenhofer, M. J., Herring, J.: Categorizing binary topological relations between regions, lines and points in geographic databases. Technical Report, Department of Surveying Engineering, University of Maine, Orono, ME (1991).

    9. Papadias, D., Theodoridis, Y., Sellis, T., Egenhofer, M.: Topological Relations in the World of Minimum Bounding Rectangles: A study with R-trees. Proc. ACM SIGMOD International Conf. on Management of Data (1995) 92-103.

    10. Yu, B., Orlandic, R., Evans, M.: Simple QSF-Trees: An efficient and scalable spatial access method. Proc. ACM CIKM International Conf. on Information and Knowledge Management (1999) 5-14.

    11. Comer, D.E., Stevens, D.L.: Intrnetworking with TCP/IP Volume III: Client-Server Programming and Applications. Prentice Hall ISBN 7-302-03094-4/TP.1648 (1997).

    12. Hadzilacos, T., Tryfona, N.: An Extended Entity-Relation Model for Geographic Applications. SIMOD Record, Vol. 26., No.3. (1997).

    13. Batty, P.: Exploiting Relational Database Technology in A GIS. Computers & GeoSciences, Vol. 18., No.4. (1992) 453-462.

    14. DeWitt, D.J., Kabra, N., Luo, J., Patel, J.M., Yu, J.B.: Client-Server Paradise. Proc. the 20th VLDB Conference Santiago, Chile (1994).

    15. Morehouse, S.: The Arc/Info Geographic Information System. Computers & GeoSciences, Vol. 18., No. 4. (1992) 435-441.

    16. Hasan, W., Heytens, M., Kolovson, C., Neimat, M.-A., Potamianos, S., Schneider, D.: Papyrys GIS Demonstration. Proc. SIMOD, Washington, DC (May, 1993).

    17. Informix Corporation: INFORMIX-ESQL/C Programmer’s Manual Version 9.12. Part No. 000-5424, INFORMIX Press (April 1999).

    18. Environmental Systems Research Institute, Inc.,: Getting to Know ArcView GIS 3rd Edition. ESRI Press, Redlands, California (1999).

    19. Environmental Systems Research Institute, Inc.,: Programming with Avenue. ESRI, Inc., Redlands, California (1995).

    20. Schildt, H.: Windows NT 4.0 Programming from the Ground UP. McGraw-Hill ISBN 0-07-882298-X, QA76.76.O63 S37548 (1997).

    21. Parent, C., Spaccapietry, S., Zimanyi, E.: Spatio-Temporal Conceptual Models: Data Structures + Spaces + Time. Proc. ACM-GIS (1999) 26-33.

    22. Tryfona, N., Hadzilacos, T.: Logical Data Modeling of SpatioTemporal Applications: Definitions and a Model. Proc. IEEE IDEAS (1998) 14-23.