extending an open source spatial database with geospatial ... · muhammad imran thesis submitted to...

123
Extending an open source spatial Database with geospatial image support: An image mining perspective Muhammad Imran March, 2009

Upload: ngotruc

Post on 08-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Extending anopen source spatial Databasewith geospatial image support:An image mining perspective

Muhammad Imran

March, 2009

Extending anopen source spatial Databasewith geospatial image support:An image mining perspective

by

Muhammad Imran

Thesis submitted to the International Institute for Geo-information Science andEarth Observation in partial fulfilment of the requirements for the degree inMaster of Science in Geoinformatics.

Degree Assessment Board

Thesis advisor Dr. Ir. R.A. (Rolf) de ByDr. Ir. W. (Wietske) Bijker

Thesis examiners Chair: Prof. Dr. A. SteinExternal examiner: Dr. B.G.H. Gorte

INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION

ENSCHEDE, THE NETHERLANDS

Disclaimer

This document describes work undertaken as part of a programme of study atthe International Institute for Geo-information Science and Earth Observation(ITC). All views and opinions expressed therein remain the sole responsibilityof the author, and do not necessarily represent those of the institute.

Abstract

The nature of vector data is relatively constant, and it is revised lessfrequently as compared to remotely sensed earth observation data. Remotesensing images are being collected nowadays every 15 minutes from satel-lites such as Meteosat. In the coming years, very high spatial resolutiondata is expected to be available freely and frequently. Integrated GIS andremote sensing spatial analysis methods have the ability to incorporatedifferent data sources to find attribute associations and patterns of changefor knowledge discovery and change detection. GIS-based data such as vec-tor data and DEM are overlayed with image data and results are takenup in a GIS for further processing and analysis. A platform is required toefficiently store, retrieve and manipulate such image data as layers justlike other GIS data layers for hybrid GIS/RS analysis. In principle, spatialdatabases are the most suitable candidates as such a platform. Our work isaimed to investigate the open source spatial database PostgreSQL/PostGIS(PG/PG) as such a platform, to provide a solution for image support and anoverall framework for integrated remote sensing and GIS analysis. This isdefinitely beyond just storage and retrieval of images in spatial databases.

The requirements and available open source libraries were extensivelystudied to provide such an image support. The TerraLib library was pro-posed, and analysed to extend the PG database with image support. Todemonstrate the application developed in this study, the Meteosat SecondGeneration (MSG) image data for a larger part of Europe was extractedfrom the ITC data receiver. An application programme was written to con-struct time series image database for extracted image data with the PG/PG DBMS. A mining application to detect clouds patterns from time-seriesimage and vector data stored in the PG database was developed using Ter-raLib conceptual schema. For this, an extensive study of data mining meth-ods was carried out. A statistical data mining method based on the prin-cipal components analysis was adopted to extract cloud features for theNetherlands from the time series image data.

Using this research platform and cloud patterns detection case applica-tion, various image mining scenarios were conducted to provide a frame-work for integrated image and vector data analysis top of the DBMS tech-nology. This framework is extremely useful for studying spatio-temporalphenomena with seasonal or long intervals and region-based studies wherethe regions on a remote sensing image are extracted by vector data.

Keywordsspatial database image support, integrated remote sensing and GIS analy-sis, data mining, cloud pattern detection, image analysis

i

Abstract

ii

Contents

Abstract i

List of Figures v

List of Tables vii

Acknowledgements ix

1 Introduction 11.1 Motivation and problem statement . . . . . . . . . . . . . . . . . 11.2 Research objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Research sub-objectives . . . . . . . . . . . . . . . . . . . 31.2.2 Research questions . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Project set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Method 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Method 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature review 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Raster data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Data model for image storage inside PG/PG . . . . . . . . . . . . 11

2.3.1 Functions for integrated vector/raster analyses . . . . . . 142.4 Proposed platform for image mining . . . . . . . . . . . . . . . . 15

3 Data mining methods 193.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Classical data mining . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Database-oriented approaches to data mining . . . . . . 223.2.3 Machine learning approaches for data mining . . . . . . 26

3.3 Spatial data mining . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Spatial statistics . . . . . . . . . . . . . . . . . . . . . . . 283.3.2 Spatial database approach to data mining . . . . . . . . . 30

3.4 Image mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4.1 Low-level image analysis to feature extraction . . . . . . 34

iii

Contents

3.4.2 High-level knowledge discovery . . . . . . . . . . . . . . . 373.4.3 Image mining from integrated image/GIS data analysis . 37

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 A database application development method using TerraLib 394.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 TerraLib, TerraView and PostgreSQL/PostGIS set-up for image

mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.1 TerraLib dependencies on open source third party libraries 40

4.3 TerraLib application development . . . . . . . . . . . . . . . . . 404.4 Conceptual data model . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4.1 Data model for storage . . . . . . . . . . . . . . . . . . . . 494.4.2 Data model for visualization . . . . . . . . . . . . . . . . . 514.4.3 Image data handling in TerraLib Database . . . . . . . . 53

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Application scenarios for image mining: Results and Discus-sions 575.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Image mining guided by GIS data . . . . . . . . . . . . . . . . . 58

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2.2 The Data preparation . . . . . . . . . . . . . . . . . . . . . 585.2.3 Method and results . . . . . . . . . . . . . . . . . . . . . . 595.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Extending TerraView for a temporal image query . . . . . . . . 645.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.2 Method and results . . . . . . . . . . . . . . . . . . . . . . 64

5.4 Database-oriented approaches to data mining . . . . . . . . . . . 655.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.2 Method and results . . . . . . . . . . . . . . . . . . . . . . 66

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Conclusions and Recommendations 716.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A Source code for creating time-series image database in PG 77

B Source code for image mining application scenario Section 5.2 81

C Source code for image mining application scenario Section 5.3 87

D Source code for image mining application scenario Section 5.4 91

Bibliography 103

iv

List of Figures

1.1 Metadata and Data types with some important fields . . . . . . 71.2 Flow diagram for providing raster support extending PostGIS with

CHIP datatype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Design levels and associated design issues . . . . . . . . . . . . 112.2 The current open source software and related Libraries [32] . . 14

4.1 A set-up for cloud detection image mining application . . . . . . 404.2 Singleton design pattern adopted for TerraLib . . . . . . . . . . 444.3 Factory design pattern adopted for TerraLib . . . . . . . . . . . 444.4 Strategy design pattern adopted for TerraLib . . . . . . . . . . . 464.5 Iterator design pattern adopted for TerraLib . . . . . . . . . . . 474.6 TerraLib software architecture [79] . . . . . . . . . . . . . . . . 494.7 Conceptual data model related to source domain for image and

vector data storage in PG modified from [29] . . . . . . . . . . . 50

5.1 Clipping of research area from MSG satellite data with vector data 595.2 An image mining process with integrated image and vector anal-

ysis with TerraLib on top of the DBMS technology . . . . . . . . 605.3 The resulting principal components for two dates at 14:00 hours 615.4 Comparison of image size on disk with size in database using com-

pression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.5 The views/themes populated as a result of temporal query . . . 655.6 A sequence of steps in a mining process to generate attribute data

in the PG database . . . . . . . . . . . . . . . . . . . . . . . . . . 675.7 Attribute data for PC images as a result of PCA algorithm applied

on time-series MSG data . . . . . . . . . . . . . . . . . . . . . . . 685.8 Time-series cloud patterns analysis for December 13, 2008 . . . 695.9 Time-series cloud patterns analysis for December 16, 2008 . . . 69

v

List of Figures

vi

List of Tables

3.1 Statistical methods for data mining. . . . . . . . . . . . . . . . . 233.2 Statistical methods for spatial data mining. . . . . . . . . . . . . 29

4.1 Third-party libraries used by TerraLib for image support in PG 41

5.1 Image size on the disk and in the database . . . . . . . . . . . . 63

vii

List of Tables

viii

Acknowledgements

I would like to sincerely thank Dr. Ir. R.A. (Rolf) de By and Dr. Ir. W. (Wietske) Bijkerfor their support and guidance all through this work.

I would like to cordially thank Prof. Dr. A. Stein in promoting and motivating me frommy day first at ITC till today.

I would like to thank Dr. Javier Morales for helping me in handling linux-specific is-sues.

I would like to affectionately thank all teachers for all interesting discussions duringGFM course work.

I take this opportunity to appreciate my fellow students in GFM programme and friendsin the Enschede. I truly believe that you all are great and caring. To Adil, Asif,Fatemma, Gufrana, Khalil, Luis, Pramod, Salma, Swati, Tahir, Tuul many thanks forsharing your warm friendship for me.

I would like to thank my mom and dad for their blessings.

ix

Acknowledgements

x

Chapter 1

Introduction

1.1 Motivation and problem statement

Mining patterns of change and association in time-series image and other GISdata in a spatial database is challenging and active area of research. This hasbecome more challenging especially for open source spatial databases as thesedatabases do not provide effective image support along with other GIS data.The open source database platform that will allow us to work with vector andimage spatial data in an integrated way will be a highly interesting researchvehicle. Such DBMS can be used, for instance, in advanced applications withintensive spatial querying such as spatial data mining and spatial change de-tection.

Images inside a spatial database have been a topic of research since mid-1980s, however, geospatial images have not been supported in open source spa-tial databases. At very recent, open source databases support vector data andspatial analysis based on that data in a declarative way, but do not supportraster image data. Remote sensing image data is space-oriented and providespixel spectral characteristics for a phenomenon at some location. Vector data isobject-oriented and provides the object characteristics at the same location interms of shape, topology, texture, colour, and so on. The image support in a spa-tial database requires that images can reside in the database in an integratedway with other spatial and non spatial data. These images can then be queried,manipulated and analysed in a seamless way.

The image storage on disks outside the database is optimal when the ob-jective is only visualization, for instance, for static web applications. Migratingimage data inside the database is most realistic when the objective is to provideintegrated image/vector spatial data analysis on image and vector data storedin the large spatial database. Many desktop GIS applications allow overlayingof remotely sensed image data with GIS data, for instance, for fast image clas-sification. But, most current GIS packages follow separate analysis proceduresfor separate data structures, through a time consuming monolithic system offunctions. Further, these GIS applications can handle single image at a timeand are not suitable for mining a large amount of time-series image data. Theintegrated spatial data analysis is bound to change completely the developmentof GIS technology enabling a transition from the monolithic systems of today to

1

1.2. Research objective

the generation of spatial information appliances based on seamless, integrated,and generic functions independent of spatial data structures [1].

Geometric co-registration of a remote sensing image and GIS vector datalayers will provide both spectral and spatial information at a geographic lo-cation. Such a layer concept can be implemented with a spatial database todevelop advanced database applications, for instance, image mining and landuse/land cover change detection. The common approach for mining remote sens-ing databases uses rules created by analysts, however, incorporating GIS infor-mation and human expert knowledge with digital image processing improvesremote sensing image analysis [2].

Provision of image support in an open source spatial database for an inte-grated image analysis for image mining or other advanced database applica-tions requires three prominent and interdependent areas for investigation:

1. Image data handling.

• Efficient image storage in and retrieval from a database. Images arerequired to be partitioned in tiles and pyramids and to be stored ina database table for fast retrieval. These tiles are required to beaccessed by client on request. The retrieval of tiles needs to getfurther benefit from hierarchal storage structure, query optimiza-tion, adequate indexing, compression and partitioning features of thedatabase [3].

• To provide functions for query and administration of the image data.

2. Image data manipulation.

• Single image operations, for instance, image segmentation and clas-sification.

• Multi image operations, for instance, overlay operations such as spec-tral ratios [4]. Overlay operations require overlay of one raster layerwith another raster or rasterized vector layer. Such overly functionsare required for seamless and integrated analysis in our proposedintegrated image/vector image mining application development.

• Data merging and fusion functions.

3. Image data visualization.

• Visualization of overlayed image and vector data obtained from adatabase.

1.2 Research objective

This project aims to extend an open source DBMS PostgreSQL/PostGIS for im-age support to conduct an integrated analysis on image and vector data storedin the spatial database, with an illustration of proposed method to perform im-age data mining. There are two main objectives.

2

Chapter 1. Introduction

1. To study and analyse state-of-the-art image support from open source li-braries and existing raster storage support in an open source database, topropose and implement a solution that can provide image storage insideopen source database, in such a way, that raster image data can partici-pate in an integrated raster/vector analysis.

2. To provide a general framework for image data mining based on hybridimage and vector data inside open source spatial databases.

1.2.1 Research sub-objectives

1. The proposed solution for image support inside PostgreSQL/PostGIS willprovide:

• Efficient image storage in and retrieval from the open source spa-tial database PostgreSQL/PostGIS along with other spatial and non-spatial GIS data such as vector data.

• Pyramid, tiling, index support and other parameters for efficient im-age retrieval performance.

• Both programming and visualization interfaces that will allow to in-sert and retrieve the image along with other datasets inside PostGISto perform overlay, spatio-analytic, statistical and aggregate func-tions over the intersection of image and vector data layers.

2. The proposed framework for spatial data mining will be based on:

• Integrated image and vector data analysis.• Scalability for performance considerations in case of large reposito-

ries of image and vector data inside PostGIS.

1.2.2 Research questions

All research questions will be answered in the context of outcome from firstobjective that will be accomplished after a complete analysis.

1. Image support from PostGIS.

• What would be the high level conceptual data model for hybrid rasterand vector datasets in database?

• What would be metadata and actual raster data storage structureand format in database?

• What would be the interface to insert and retrieve the image into/from along with other datasets in spatial database?

• How to get and set (manipulation) the raster data while traversingwith a reference system through image?

• How to calculate zonal statistics over the raster image area clippedby vector feature?

3

1.2. Research objective

• How to provide overlay operations over the raster and vector layerslike union, intersection, etc.

2. Image data mining.

• What would be the integrated framework for mining based on bothraster and vector data inside PostgreSQL/PostGIS? That will includedomain knowledge, image processing techniques, Image/vector dataretrieval and preparation, hypotheses building and testing, miningalgorithm building and performance scalability, etc.

• What would be an interface to visualize and interpret the results?

1.2.3 Background

Mattikalli [5], presented a methodology of integrating remotely-sensed rasterdata with vector data. The approach was based on the mathematical con-cepts of sets and groups, and was successfully implemented for the analysisof historical land use change from 1931 to 1989 in the River Gken catchment,U.K. Remotely-sensed images were converted into georeferenced vector layers.These resulting layers were then employed with other vector data in GIS toperform land-use change analysis. It was shown that this approach can be ef-ficiently adopted for operational use incorporating products derived from bothcoarse- and fine-resolution remotely-sensed satellite images once these are in-tegrated with the vector-based GIS.

In recent years, incorporation of multi-source data (e.g. aerial photographs,TM, SPOT and previous thematic maps) has become an important method forland-use and land-cover (LULC) change detection, especially when the changedetection involves long time intervals associated with different data sources,formats and accuracies or multi-scale land-cover change analysis [6]. Imagedata is an important component of any large scale spatial database, since RS(land-use classification, mapping, and so on) and GIS techniques are used to in-terpret these databases [7]. Weng [10], used the integration of remote sensing,GIS and stochastic modelling to detect land use change in the Zhujiang Deltaof China and indicated that such integration was an effective approach for an-alyzing the direction, rate and spatial pattern of land-use change. Automateddetection of change and anomalies in the existing databases using image infor-mation can form an essential tool to support quality control and maintenanceof spatial information [9].

Yang and Lo [8], used an unsupervised classification approach, GIS-basedimage spatial reclassification, and post classification comparison with GIS over-lay to map the spatial dynamics of urban land-use/land-cover change in theAtlanta, Georgia metropolitan area. GIS-based image analyses have shownmany advantages over traditional change detection methods in multi-sourcedata analyses [11]. More research focussing on integration of GIS and remotesensing techniques is necessary for better implementation of a change detec-tion analysis and discovery of patterns from such analysis in image miningprocesses.

4

Chapter 1. Introduction

The construction of spatial databases that handle raster data types has beenstudied in the database literature and the main approach taken has been to de-velop specialized data servers, as in case of PARADISE [12] and RASDAMAN.The chief advantage of this approach is the capacity of performance improve-ments, especially in case of large image databases. The main drawback of thisapproach is the need for a specialized, non-standard server, which would greatlyincrease the management needs for most GIS applications. The approach takenby TerraLib eliminates such drawbacks and includes raster data managementinto object-relational DBMSs. This is achieved by means of adequate indexing,compression and retrieval techniques, satisfactory performance can be achievedusing a standard DBMS, even if very large satellite images [5].

GeoMiner [14], developed at Simon Fraser University, is a spatial data min-ing system prototype able to characterize spatial data using rules, compare, as-sociate, classify and group datasets, analyze patterns and perform data miningin different levels. ADaM [15], a NASA project with the Alabama Univer-sity in Huntsville, is a toolset to mine images and scientific data. It performspattern recognition, image processing, optimization, association rule mining,among other operations [16].

1.3 Project set-up

We will investigate the two approaches to provide the required indexed andoptimized image support along with other data sets inside open source PG/PGfor the proposed integrated image/vector data mining perspective.

1.3.1 Method 1

The first approach is to use the TerraLib library to extend the PG/PG DBMSfor providing image support. The TerraLib is an open source GIS library thatbuilds its conceptual model and related metadata tables to handle raster (im-age) and vector data in the PG database. The TerraLib library classes are builtover these metadata tables in the database. The data and metadata tables inthe database are managed by the TerraLib when an application programmeexecutes its operations.

The TerraLib library uses built-in geometry types of the PostGIS for vectordata. For raster data it uses the PG BLOB (Binary Large object) type as PG doesnot provide any complex data type for raster data. The TerraLib library usesthe PG DBMS for providing raster/vector storage, indexes, query optimization,persistence, multi-user access, and so on.

There are two options to provide the required raster data support in the PG/PG database for an integrated image/vector analysis using the TerraLib library:

• Using the TerraLib programming interface to develop advanced databaseapplications and prototypes through raster/vector integrated analysis. Theapplication developer will use the TerraLib classes built over its concep-tual schema in the PG database for database applications development.TerraView is an open source interface product built on top of the TerraLib

5

1.3. Project set-up

library that aims to provide visualization interface for the rapid develop-ment of integrated GIS applications based on both raster/vector data inthe PG/PG database.

The first step in this case would be the complete understanding of the Ter-raLib conceptual model, classes, and interfaces to develop and implementa data mining application on remote sensing image and GIS vector datastored in PG/PG.

• The TerraLib and PostGIS can also be integrated at a lower level as bothare OGC-compliant libraries written in C++. Any effort to bring the Ter-raLib intermediate level code into a lower level in the PG DBMS for com-plete integration will require the TerraLib architectural level revisions,for instance, revising projection class to work with reference system meta-data table of the PostGIS. The complete integration will require creatingan ADT (Abstract Data Type) for raster data in the PG and extending itwith the TerraLib library. This ADT can be further extended with Ter-raLib provided functions to execute them from the PG SQL interface.

The first step in this case would be the complete integration of the Ter-raLib library and PG/PG DBMS prior to develop an image mining appli-cation based on remote sensing image and vector data in the PG database.

1.3.2 Method 2

The second approach is to extend the PG DBMS with a complex data type andoperators for efficient image handling in the database. The first step in thiscase is the design and development of that complex data type and functionsprior to develop an image mining application based on remote sensing imageand vector data in the PG database.

The PostGIS provides a simple abstract data type for raster data called theCHIP. This data type does not provide efficient image storage options such asindexing and tiling, image manipulation functions, and overlay functions. Thiscan be improved into a complex type to provide the required image storage,manipulation, and analytic functions for an integrated image and vector dataanalysis.

To provide such a complex type for raster data, a PGRaster metadata typehaving GEOMETRY (BOX3D) for whole image extent can be defined. As shownin Figure 1.1, the defined metadata type has one-to-many relation with theCHIP type. An image is divided into blocks or tiles to store as CHIP data.Each tile also has a unique bounding box or extent. The CHIP data type hasheader (metadata) and actual raster data for an image tile or block. This CHIPdata type will be extended then with user-defined function for image data ma-nipulation from PG SQL interface.

PGCHIP is an open source library that provides an interface between clientapplication and PG CHIP data type [17]. Based on improved CHIP data typefrom previous step, PGCHIP can also be improved to develop a database loader/dumper utility for raster data import and export.

6

Chapter 1. Introduction

Figure 1.1: Metadata and Data types with some important fields

Figure 1.2: Flow diagram for providing raster support extending PostGIS with CHIP datatype

7

1.4. Thesis structure

The Starspan is an open source library for integrated raster/vector analy-sis. This library can also be extended with CHIP data type to provide inte-grated raster/vector analysis functions. The complete work flow for the pro-posed method is shown in Figure 1.2.

In the next chapter we will investigate both approaches keeping data miningand other performance issues in focus. We will accept the best approach basedon scientific reasoning and research constraints.

1.4 Thesis structure

This thesis comprises of six chapters.

• Chapter 1 is the current chapter within which this section is contained. Asit has been noted, this chapter introduces the problem statement, researchobjectives, and research questions. The scope of this study and generalapproaches to carry out that has been shown.

• Chapter 2 contains an extensive review of all individual efforts that havebeen carried out for providing image support in the open source Post-greSQL/PostGIS database. The requirements for providing image sup-port in a DBMS have been identified and based on these requirements amethod from proposed methods is selected.

• Chapter 3 provides various methods for data mining. The data miningis further reviewed for extended spatial and image concepts for spatialand image data mining respectively. These methods are referenced fordeveloping image mining scenarios with provided image support in thePG database.

• Chapter 4 elaborates the procedure to build and investigate a set-up onthe TerraLib and PostgreSQL DBMS technologies, as a method for build-ing advanced database applications such as data mining, and change de-tection with integrated image and GIS data analysis. This chapter alsoprovides the understanding to work with the TerraLib library throughintroducing the TL classes, conceptual schema, data models, and imagehandling.

• Chapter 5 provides various statistical and database-oriented mining tech-niques that were applied to evaluate the capabilities of proposed methodfor integrated analysis based on the TerraLib and the PostgreSQL DBMStechnologies. Data processing steps were provided with results and dis-cussions.

• Chapter 6 finally presents conclusions and recommendations drawn fromthe study.

8

Chapter 2

Literature review

2.1 Introduction

An extensive literature review to provide image support in an open source -database PostgreSQL/PostGIS (PG/PG) is carried out. The objective in provid-ing an image support in spatial database is to use image and GIS data for in-tegrated Image/vector analysis in an image mining process. A spatial databaseis the most suitable candidate for such kind of integrated analysis. The peopleare trying to provide image data support in PG/PG since year 1998. This chap-ter will investigate all such efforts and will propose a suitable method that willenable us to fulfil the objectives. Section 2.2 explains raster/image general con-cepts, OGC specification for raster implementations, and DBMS considerationsfor such implementations. Section 2.3 describes the design issues proving im-age support in PG/PG and work done addressing these design issues. This alsodiscusses functions required for integrated image/vector data analysis and pos-sible solutions in providing these functions in the PG. The section 2.4 describesthe proposed method and all technologies to provide an image support in thePG. The proposed set-up will serve as a research platform to develop advancedatabase applications through integrated image/vector analysis.

2.2 Raster data model

A raster data model divides space into grid of cells. The position of a cell in agrid is described by cell co-ordinates (ı, j); where ı is the row number and j isthe column number. The cell width in ground units represents the resolution ofraster data. The cell value has a data type and size called the cell depth. Thecell data type can be a primitive such as integer, real number, and so on. Thiscan also be a code number which is referenced to an associated table, calledlook-up table or value attribute table. The value recorded for a cell can be adiscrete value such as land use, a continuous value such as rain fall or a nullvalue if no data is available at the particular location.

Georeferencing transforms the cell coordinate system to a ground coordinatesystem for the raster data. Different attributes are stored as separate layers.These layers can for instance be the bands of multi-spectral image or thematic

9

2.2. Raster data model

layers of land use data. Each cell value in a band or layer can be further ex-tended with types for example a color map type (to map a thematic value in cellto RGB), or with a grayscale type.

An image is a specialized case of raster data, and a cell in that case is calledpixel. The upper left corner cell value is the reference value and starting pointof the block. These starting cell values are registered as metadata and used tojoin the blocks when required. For a multi-band image, the axis along bands iscalled the band dimension. For a time series multilayer image (where each layerhas a different date or timestamp), the axis along layers is called the temporaldimension. A layer is a logical concept for storing single or multiple bands of araster in a database. A block is the smallest logical unit for data storage on disk.When data for multiple bands need to store in a single block, an interleavingtechnique is commonly adopted to arrange the data of each band. The imageis usually divided into multiple tiles and different pyramid levels are built forfast retrieval [18].

OGC defines raster type as coverage that refers to any data representationthat assigns value directly to a spatial position. OGC provides general imple-mentation specifications for querying such data, however, it does not providespecifications for a strict storage model or representation. According to OGC,an essential property of the coverage is to be able to generate a value for anypoint within its domain. How the raster would be implemented internally is nota concern. This can be represented by set of polygons which exhaustively tilea plane, a grid of values, a mathematical function or a combination of all theseuntil, a value can be returned by coverage for any atomic cell. OGC Specifica-tions for querying raster data allow access to geospatial coverage for its valuesor properties of geographic locations. The set and get methods are provided toset and get a raster cell value through some enumerations like a cursor in SQLprogramming interface. A raster coverage is usually provided with a methodfor interpolating values at spatial positions between the points or within a cell[19].

Images in space are geo-referenced which distinguishes OGC raster/coveragedata sets from SQL MM Part 5: Still image. An image geo-referencing processassociates the location in an image with a geographic, or local coordinate sys-tem. The spatial database associates a referencing system dynamically whileperforming image operations. A spatial reference systems (SRS) is usually im-plemented through transformations such as six-parameter affine transforma-tion. The parameters for transformations are stored in metadata tables [20].

For raster storage, relational databases support variable length numericdata types with variation in precision and variable length character data types.These data types have limited storage compared to image requirements. Mostrelational databases offer support to store such image data as Binary Large Ob-ject (BLOB), but they do not offer the fine access control (e.g to pixel level)forsuch BLOBS. The disadvantage is that a single bit change locks the wholeBLOB and a single bit call loads the whole BLOB into buffer cache.

To have more access control over storage such as BLOB, modern extend-able database technology supports user-defined ADTs (Abstract Data Types)and user-defined functions callable from SQL. “The key feature of OR-DBMS is

10

Chapter 2. Literature review

Figure 2.1: Design levels and associated design issues

that it supports a version of SQL, SQL3/SQL99 that provides the notion of user-defined types (as in Java or C++) ” [21]. These extensions provide a powerfulmechanism for:

1. Variable length facilitating efficient use of disk space. Tiling and imagepyramids are used for fast retrieval of image through fine-grain splittingof image.

2. Declarative support through SQL API interfaces.

3. Developers can develop their own domain-specific extensions through theDBMS provided API interfaces.

The Raster data is stored as field whose type depends on the DBMS in whichdata is being stored. In DBMS’s with spatial extension, the field type is the typeprovided by the extension, otherwise it is a BLOB [22].

To provide the raster/coverage support in the PostgreSQL/PostGIS DBMS,decisions over design issues are required at three levels shown in Figure 2.1.We will discuss the work done so far at these three design levels and will selectthe most suitable method to provide image support with PG/PG for integratedimage/vector analysis.

2.3 Data model for image storage inside PG/PG

For image support from a spatial database, DBMS developers need to design ahigh level conceptual data model for image data and metadata based on com-plex data types. These complex data types need to be extended with functionsto access and manipulate the image data through an SQL interface.

In ORDBMS, non-structured data can be organised using WKB (well- knownbinary) data types. A straight-forward way is to store raster data in WKT (well-known text) or WKB column, and associated metadata stored in other columns.For performance, an image can be stored in multiple rows so that each row will

11

2.3. Data model for image storage inside PG/PG

manage a single tile. Other mechanisms such as data compression and pyramidstructure can also be used to improve the efficiency and performance [23]. Thebounding boxes of these tiles are stored as geometry and are used to build theGIST index. Mostly, a unique identifier associated with a tile is used as primarykey for queries and for caching in a hierarchal storage structure.

PostgreSQL/PostGIS currently comes with two raster storage models TOAST(The Oversized-Attribute Storage Technique) and BLOB (Binary Large Object).

A table with column having potentially large entries (> 8 KB) has an asso-ciated TOAST table, whose OID is stored as locator. A locater is used to linka row from a database table to its out-of-line toasted field values stored in aTOAST table. To support TOAST, a data type must have a variable-length (var-lena) representation like bytea. The parameter TOAST MAX CHUNK SIZEin number of bytes is used to divide out-of-line value, for example images intochunks. Each chunk is stored as a separate row in the TOAST table [25].

The advantage of this architecture is that it allows a simple data modelfor raster storage with some effort like storing the bounding box of a tile asgeometry for GIST index support. We can then store metadata in a metadatatable and blocks/tiles of an image inside a TOAST table. The size of a block/tile can be controlled by TOAST MAX CHUNK SIZE. This could be very usefulwhen:

1. The objective is just to locate and display an image with selection criteriaon attributes and not another image

2. The raster image data is too small to insert through SQL insert statementand applications frequently access attributes other than chunks of image.

The disadvantage is, image manipulations need to read and write at pixel level,while traversing through an image inside database. But PostgreSQL managestoasted data itself without providing a cursor handler or an iterator to traversethrough an image chunk. Whole chunk is loaded into the database memorystructure at once even if a single pixel need to read or write.

The PostGIS PGRaster SQL Interface “shall” requirement document states“For a bytea/TOAST storage model, PGRaster data shall be inserted throughSQL insert statement and shall retrieve through SQL select statement ”. Moreefficient insertion can be obtained by creating a prepared insert statement us-ing a low-level interface function to provide the binary data, separate from SQLtext. In the PostgreSQL C API, PQexecPrepared and PQexecPrams provide thislow level functionality [24].

A second solution that most of the ORDBMS provide is BLOB or SmartBLOB. The application receives a handler to read from a BLOB using a wellknown file system interface open, close, read, write and seek. This allowsfine-grained access to the BLOB type, to seek a specific location and read/writechanges using programming interface, and to write functions to extend SQL.

PostgreSQL client and server side programming interface Library for LO(libpq) provides manipulation for large objects. The library contains LO func-tions. SQL is extended to call these functions within SQL commands and largeobjects manipulation takes place using lo functions within SQL transaction

12

Chapter 2. Literature review

block. The difference between TOAST and LOB handling is that TOASTed datais automatically managed by PostgreSQL while large objects can be randomlymodified at the finest atomic level. A disadvantage for large objects is that theyneed a trigger to be defined to delete the large object when its referencing rowis deleted [25].

A PostGIS abstract data type (ADT) and associated functions called CHIPdefine a raster header and actual raster data in-line. CHIP provides some basicfunctions for manipulation and an external programme can write into or readfrom CHIP data by using these functions. CHIP does not provide any functionsfor handling and manipulation of images.

PGCHIP is an open source driver that uses the GDAL library for raster dataread and write operations, with the CHIP data type [26]. PGCHIP externalprogramme provides support for both OGC based well known text representa-tion of spatial reference system, i.e., SRTEXT and Proj4 text representation,i.e., PROJ4TEXT to provide coordinate transformation capabilities. PGCHIPis under development and also does not provide the image manipulation andimage/vector overlay operations.

The Oracle physical data model for images consists of two object types. AGeoRaster type for metadata for a whole image (image header), and a raster ob-ject type for each block/tile of actual image data as BLOB in out-of-line fashion.The foot-print of each block is also extracted and stored as raster type attributeto build spatial indexes. The GeoRaster object type is further extended withXML object type for metadata storage. Only image foot-print is stored as fieldof GeoRaster type out of XML type. An image can be inserted through SQLinterface by insert statement and also by a loader utility through commandprompt [27].

Along similar lines as Oracle GeoRaster, Xing Lin developed a raster modelfor PostGIS named PGRaster [20]. Metadata is stored as fields of PGRASTERMETADATA type with spatial extent (BBOX) and SRID. Raster value typerefers to scale of measurement that can be nominal, ordinal, interval or ra-tio. The model coordinates have the same unit as the specified SRID. Actualimage data is stored in PGRASTER object type with blocks. A Geotiff2pgrasterloader creates blocks, GIST index, and pyramid levels while importing the im-age into the database. Some basic SQL functions to handle parameters of datamodel are provided, however, many functions for image manipulation are underdevelopment. The source code is distributed under the terms of GNU GeneralPublic Licence by Refractions Research Inc. There is a debate over Xing Lin’sPGRaster source code and design regarding patent violation by Oracle Corpo-ration.

INPEs (National Institute of Space Research) TerraLib is an open sourcesoftware library that extends object-relational DBMS technology to supportspatio-temporal models, remote sensing image databases, and integrated spa-tial analyses [28]. It provides support for PostgreSQL/PostGIS, MySQL, MS,and Oracle DBMS. TerraLib provides interfaces for C++, Java, COM and OGISweb service environments for GIS application development. TerraLib createsits own conceptual model by opening the connection through application pro-gramming interface (API) driver. It provides support for handling large image

13

2.3. Data model for image storage inside PG/PG

Figure 2.2: The current open source software and related Libraries [32]

data sets providing indexes, tiling, pyramiding, and compression techniques.TerraLib raster storage model follows OpenGIS implementation specificationsfor grid coverage, and also fulfills PostgreSQL “Shall” requirements for rasterstorage. TerraLib vector features are fully OGC compliant.

TerraLib raster data structures include:

1. Raster: a multi-dimensional raster data structure (Used for images andgrids).

2. Cell: a single cell, used for building cell spaces.

Cell spaces can be seen as a generalized raster structure in which each cellstores more than one attribute value or as a set of polygons that do not intercepteach other. It handles spatio-temporal data types (events, moving objects, cellspaces, modifiable objects) and allows spatial, temporal and attribute querieson the database. TerraLib supports dynamic modeling in generalized cell spacesand spatial data mining, and has a direct runtime link with the R programminglanguage for statistical analysis on spatial data [29, 30].

2.3.1 Functions for integrated vector/raster analyses

All C-based open source projects are built upon reused libraries and form thebases for integration and interaction between different formats of spatial datasets [32].

GDAL is an open source translator library for raster (GDAL) and vector(OGR) geospatial data formats [33]. As shown in Figure 2.2, almost all opensource software packages use GDAL/OGR for manipulating raster and vectordata, and show conceptually a tendency to provide functions for integratedraster/vector analyses. The OGR library can read vector data sets and trans-form these into feature layers. An OGR layer can be further rasterized withGDAL libraries to intersect any geometry feature in a vector data source with

14

Chapter 2. Literature review

the GDAL raster layer. Rasterization is the process of burning the vector poly-gons into a raster, and vectorization is the process of burning raster layer intovector polygons. These processes are often carried out for interaction betweenraster and vector data for hybrid raster/vector analysis.

On this principle, the Starspan utility program was designed to fuse rasterand vector layers for spatial analysis. A basic operation performed by Starspanis the extraction of spectral data from raster files whose pixels are geometricallycontained in the geometry features (points, lines, polygons) in vector data [34].The Starspan is open source, written in C++, using GDAL/OGR/GEOS librariesand works with all the formats supported by underlying libraries. Various al-gorithms are used according to type of geometry, to find the pixels in a raster Rthat are contained in the given geometry features in vector V [35].

TerraLib libraries also provide functions to perform zonal operations over aregion of a raster layer clipped by vector features. The zonal operation calcu-lates a set of statistical measures (for example sum, mean, variance, and so on)over the raster layer region that is inside the polygon representing the regionof interest. The result is returned in a data structure provided by TerraLib.

2.4 Proposed platform for image mining

The first proposed method is adopted: to use the open source TerraLib/TerraViewlibrary to provide image support in the PG/PG DBMS. This will provide aresearch platform to develop advance applications through integrated image/vector analysis such as image mining. The TerraLib conceptual schema will becreated in PG for image and vector data storage and retrieval for image miningapplication. The TerraLib library will be used for spatial data analysis, basedon hybrid raster and vector operations, and image processing. The TerraLibC++ programming interface will be used in an algorithm development for im-age mining. TerraView visualization interface will be used for visual interpre-tation, training the image data, and analyzing the results in our image miningapplication development. This method selection is based on a comprehensivestudy of existing libraries from previous work and the following requirements:

1. Efficient Image Support

The advanced database applications such as image mining require imagesupport from a DBMS with high image retrieval performance. TerraLibuses indexing and compression of images for storage inside the database.Multi-resolution pyramids store the raster data at various sizes and de-grees of resolution. Each resolution level is divided in tiles. A tile hasa unique bounding box and a unique spatial resolution level, which areused to index it. The TerraLib stores multi-resolution pyramids up toseven levels of resolution. This avoids unnecessary data access at clientside, however, it requires extra storage. To compensate extra storage re-quirement, TerraLib applies lossless compression to the individual tiles.When retrieving a section of data, only the relevant tiles are accessed anddecompressed [28]. TerraLib also supports BSQ, BIL, and BIP interleav-

15

2.4. Proposed platform for image mining

ing methods. Raster image data can be shared in different formats suchas GeoTIFF, TIFF, JPEG, RAW, ASCII-Grid and ASCII Spring.

2. Hybrid Image/Vector Data AnalysisTerraLib/TerraView manages image data with a PG database and allowsthe visualization and manipulation of it together with vector data. Over-lay functions provided by the TerraLib can operate on image and vectorlayers. A vector polygon and image pixels inside that polygon will pointto the same geographical location provided the corresponding layers areprojected with similar reference system. An operator can easily extractpixels coinciding with specific polygon area and perform statistical anal-ysis. TerraLib also offers rich overlay operations, for instance, difference,union, and intersection. A decoder class can perform raster conversionbetween 52 raster formats.

3. Statistical AnalysisAny platform for image mining that has no spatial statistical capabilitiesis insufficient. TerraLib has a basic spatial statistical package, includinglocal and global autocorrelation indexes, non-parametric kernel estima-tors, and regionalistic methods [36]. Additionally, TerraLib provides a di-rect link with the R programming language using the aRT package [37].R is an open source language and environment for statistical computingand graphics. Packages in R relevant to GIS include geoR for geostatistics,splancs for analysis of point processes, and sp for general spatial analysis.TerraLib developers quickly develop wrappers for R without knowing Rinternals and TerraLib applications users use these wrappers in spatialstatistical analysis without knowing R syntax. Using the aRT packageinside a TerraLib application, an operator queries spatial data stored inthe database, conducts spatial statistical analysis and can store the re-sult back in the database. These results can be further displayed withTerraView [28].

4. Image Processing AlgorithmsAnother important requirement to provide image mining support is toprovide a set of image processing algorithms. These algorithms extractfeatures from image as a precursor to image mining. Image processingalgorithms typically take one or more images as input, and produce oneore more images as output. TerraLib libraries come with decoder transla-tion utilities that convert to and from 52 different raster formats includingpopular formats. TerraLib includes basic operations for changing the size,orientation, scale, and other properties of an image. It provide a large setof algorithms for classification, mixture models, and geometric transfor-mations.

Scaling or resampling enlarges or reduces a raster object by changing itsgeometry. TerraLib support various resampling and interpolation meth-ods, including interpolation by nearest neighbour, average of K-nearestneighbour values, simple and weighted average of elements in a box. It

16

Chapter 2. Literature review

also supports sub-setting by clipping of image feature using polygon andby difference, intersection, union, and so on of various image bands.Filters are used for image enhancement, brightness and contrast adjust-ment, edge sharpening, smoothing, distortion correction and reducing saltand pepper effect. Filters have a very important role in image data prepa-ration. TerraLib implements convolution filters, linear filters, border de-tection filter, buffer-based filters, meteological fitters, radar filters for con-trast growing through image interpolation, radar KAUN filter for reduc-ing speckle noise in SAR images, statistical LEE and FROST filters andgeometric ASF filters. Wavelet denoising filters are implemented to re-move additive Gaussian noise by thresholding the wavelet coefficients.Re-mapping is the process of changing an image geometry to fit to adjacentimages. It utilizes background estimation, mask generation, parameterestimation and dynamic range re-mapping of an original image to providean optimal output image. TerraLib dynamic range re-mapping includes anadditive algorithm or multiplicative algorithm or a combination of both.The other state-of-the-art image re-mapping algorithms implemented byTerraLib are colour, and PC-based.Segmentation refers to the process of partitioning a digital image intomultiple regions (sets of pixels). The segmentation is very important inimage mining and is normally performed before classification. The pur-pose is to simplify the representation of an image into something that ismore meaningful and easier to analyze. The result of image segmenta-tion is a set of regions that collectively cover the entire image, or a set ofcontours extracted from the image (edge detection). The pixels in a regionare similar with respect to some characteristic or computed property, forinstance, colour, intensity, or texture. All state-of-the-art segmentationalgorithms such as region growing algorithm, clustering and K-Meansclassification, edge detection filters, snake algorithm, and so on are im-plemented by TerraLib for image segmentation.Image Fusion is the process of combining relevant information from twoor more images into a single image. The resulting image will be more in-formative than any of the input images. In remote sensing applications,increasing availability of space-borne sensors gives a motivation for dif-ferent image fusion algorithms. The image fusion techniques allow theintegration of different information sources. TerraLib implements HIS,Wavelets and Enhanced Wavelets image algorithms for image fusion.TerraLib allows creation of mosaics or composition of different data filesto single representation. TerraLib image transparency allows making animage theme active over a vector theme and a vector data can be seenunder the image theme.

5. Temporal SupportTerraLib supports two basic containers for spatial data: spatio-temporalobjects and layers. A spatio-temporal object is an atomic feature whoseidentity is unique and it persists over time. A layer is a collection of

17

2.4. Proposed platform for image mining

spatio-temporal objects that share the same geographical projection andthe same set of attributes over a temporal period [39]. Information atsome geographic location can be generated from different layers projectedwith same geographic co-ordinates. Layers can be bands of the Hyper-spectral image, thematic raster data such as soil maps, vector data, ortemporal dimension stored in a spatial database.

6. Iterators over spatio-temporal data structuresThe TerraLib uses generic iterators over spatial data structures. Theseiterators allow sequential traversal of entire image or an element of a por-tion of image delimited by a polygon. A request to raster class is handledby a decoder to work with different image formats seamlessly. Iteratorsprovide abstraction for image data types and bridge image processing al-gorithms with data structures [31].

7. TerraLib Programming InterfaceTerraLib provides C++, Java, COM, and Open Geo-Services (OGIS) en-vironment for spatio-temporal application development. There are manydifferent types of applications, and user requirements that require differ-ent components of TerraLib in different order. A complete general-purposeinterface incorporating all TerraLib components to satisfy all applicationswould potentially be very complex. An alternative approach is to constructsimple application interfaces around specific work flows [38]. Using theTerraLib library, such applications and prototypes can be rapidly devel-oped according to different requirements, for instance, ordering of oper-ations, conversion of files, connecting the input of one operation to theoutput of another, and so on.

8. TerraLib visualization InterfaceGeographic visualization is the use of visual geographical displays to ex-plore data. Such exploration helps to generate hypothesis, to develop prob-lem solution, and to construct knowledge [40]. TerraView visualizationinterface provided by INPE facilitates training of image data and creatinghypothesis through visualization of various spatio-temporal datasets atthe same location. The results of image processing algorithms and imagemining can also be visually analysed and interpreted.

9. Flexibility, Extensibility, and ScalabilityTerraLib architecture is highly flexible. A front-end application worksseamlessly with heterogeneous database data sets as it can keep severalconnections to different databases at the same time. A new algorithm canbe added by registering through kernel, a new data format can be addedby adding decoder class and a new database can added by writing a driverwithout affecting any other code.

18

Chapter 3

Data mining methods

3.1 Introduction

Image mining is a multi-disciplinary technology that hires tools and techniquesfrom the DBMS technology, statistics, machine-learning, and image processing.In this chapter state-of-the-art tools and techniques for data mining are ex-plored. These tools and techniques are used in Chapter 5 of this thesis for de-veloping various image mining application scenarios. Section 3.2 explores toolsand techniques for classical data mining that provide the basis for overall min-ing processes. The tools and techniques in classical data mining are extendedwith the introduction of spatial concepts for spatial data mining, and furtherintroducing image processing concepts for image data mining. Section 3.3 dis-cusses spatial concepts to classical data mining for spatial data mining. Section3.4 discusses image processing concepts to spatial data mining that provide thebasis for an image mining process.

3.2 Classical data mining

High-dimensional data in large databases leads to so called “Information Gap.”As a result, we have a huge amount of data and little information. This re-quires techniques to handle the dimensionality of data and to discover informa-tion that has been hidden in large, complex and high-dimensional data sources.Data mining is a technique to reduce the information gap and to discover pat-terns, associations, or relationships among the data.

A whole data mining process is a collection of various sub processes. Thisstarts from understanding the domain, defining problems, making assumptionsabout the data, data preparation, data transformation, mining patterns, andevaluating results to extract the knowledge [77]. The data preparation in-volves data selection and data reduction. The data reduction is a process toremove noise from the data in terms of irrelevance. The data transformationphase transforms a standard database into a form that is suitable for use bymining algorithms. Actual mining is to apply tools and techniques for discover-ing patterns and is a sub process in the whole data mining process. Evaluationis performed to discover knowledge from patterns and to make suggestions to

19

3.2. Classical data mining

the problem for which the mining activity was carried out.Data mining techniques are built using computer science and statistics.

Databases, artificial intelligence, and machine learning are important areasfrom computer science that play a role in developing techniques for data min-ing.

Inductive learning is a process of learning from the environment. Accord-ing to Hand [41], two kinds of structures are sought during learning in a datamining activity: models and patterns. A model is a global summary of rela-tionships between variables that helps to understand a phenomenon. Here,tools and techniques are supervised and mostly driven by prior knowledge andsimplifying theory. In contrast to the global description given by a model, apattern is often defined as a characteristic structure exhibited by few numberof points, for instance, a small group of customers with high risk. Here, toolsand techniques are mostly self-organizing or unsupervised [42].

Data analysis techniques in the data mining are mostly based on apriori orself-organizing learning:

• Apriori learning (supervised induction) This is to learn from exampleswhere an operator helps the system to construct a model by defining classesand by providing examples of each class. The nature of such learningleads to the predictive modelling or discovery-driven data mining. Predic-tive models make prediction about values of the data using a set of knownrules called apriori information, “truth data” or “training set “ [44]. Clas-sification, regression, and time-series analysis are examples of predictivemining algorithms. These models may take a hypothesis from the userand test validity against the data. The emphasis is with the user whois responsible for formulating the hypothesis and for issuing the queryon data to confirm or negate the hypothesis. Data classification has beenwidely studied in machine learning, statistics, neural networks, and inexpert systems.

• Self-organised learning (unsupervised induction) This is learning from ob-servation and discovery. The nature of such learning leads to the descrip-tive mining algorithms that are exploratory in nature and identify thepatterns and relations in the data itself [44]. The user supplies data tothe mining system without defining classes or providing any prior infor-mation. The mining algorithm observes data, reduces the dimensions andrecognizes patterns by itself. The mining algorithm then results in a set ofclass descriptions for each class discovered in the environment. Examplesof descriptive algorithms include statistical and mathematical methodslike clustering and association rule mining. This approach of learning iscalled verification-driven data mining. It extracts information in the pro-cess of validating a hypotheses postulated by the user [42]. The emphasisis on the system by automatically discovering important information hid-den in the data. Data clustering has been studied in statistics, machinelearning, spatial databases, image processing and the data mining area.

Two main tasks must be carried out in a data mining process: developing

20

Chapter 3. Data mining methods

data analysis technique or algorithm, and scaling the algorithm over large datasets in terms of computational cost [43]. Any technique developed for datamining depends upon: nature of application domain and associated problem,nature of data (such as type, volume, dimensions, and so on), computationalscalability (such as setup, execution cost, and so on), and stakeholder. Thestakeholder includes a domain expert, a data miner or any intended user thatcan use results of data mining.

We will discuss techniques for data mining borrowed from three major areasi.e. statistics, databases and machine learning. The techniques in classical datamining are adjusted for spatial data mining and image data mining respectivelyaccording to domain-specific considerations.

3.2.1 Statistics

Statistical learning through model construction and pattern recognition fromthe data is an approach to machine intelligence. Once a statistical model fromthe data is recognized, probability theory and decision theory can be applied toget an algorithm for estimation, description and prediction [21]. In descriptivelearning, the statistical significance of a user hypothesis about the entire datais determined. It is difficult to conduct a data mining activity without properconsideration of its fundamental statistical nature.

The statistical approaches adopted for data mining are different from stan-dard classical statistics. In classical statistics, data is normally collected bykeeping a question in mind. The statistical models are based on theory aboutrelationships among variables or description of the data. In contrast, in datamining, one may simply carry out a stepwise search over a set of potentiallyexplanatory variables to obtain a model or pattern. The model or pattern ob-tained posses well predictive power to confirm the assumptions about the data.In statistical inference a statement about the population is made after observ-ing the sample, whereas data mining algorithms are executed over the entirepopulation of data. The nature of a statistical analysis in classical statisticsis confirmatory to best fit the model over the observed data. Therefore, a pre-cise sample data collection is primary keeping method or model considerationin mind. Whereas, statistical analysis in data mining is exploratory to discoveran unexpected model or pattern in the data, and precise data collection is sec-ondary. Data in the data mining is voluminous and many classical statisticaltools may fail in such circumstances, for instance, a scatter plot of one millionpoints can easily display as a solid black line [41].

The role of statistical techniques for the data mining starts from the datapreparation phase. The data cleaning refers to identification of anomalies inthe data that need to be removed or separately addressed in an analysis. Theoutlier analysis searches for the data items that are unexpectedly deviatingfrom some norm. Variable-by-variable data cleaning is a straight-forward pro-cess but often anomalies only appear when many attributes of a data point aresimultaneously considered [45]. Detecting and removing outliers is importantin the data mining for cleaning the data or investigation of unusual events. Ta-ble 3.1 shows statistical methods that can be used in various stages of a data

21

3.2. Classical data mining

mining activity.Segmentation involves partitioning selected data into meaningful groups.

Segmentation process is either classification or clustering.Classification is normally supervised and based upon some predefined rules

from training data. A parametric classifier assumes that data follows someknown parameterized probability density function (PDF). A nonparametric clas-sifier is typically used when there is insufficient knowledge about the type ofunderlying PDF for the domain. Self-organizing classifier models, such as cer-tain kinds of neural networks, are also considered nonparametric classifierswhen they make no apriori assumptions about the PDF.

Clustering is an unsupervised process in which different statistical tech-niques are employed for partitioning the data into groups. Cluster analysismethods for data mining must accommodate large data volumes and high di-mensional data. This usually requires statistical approximations or heuristics[46]. Many machine learning techniques are also modified for statistical learn-ing, in which decisions are made using a statistical criterion over attribute val-ues. Table 3.1 shows various statistical techniques available for segmentation.

Principle component analysis (PCA) is a statistical clustering technique basedon statistical correlation as mentioned in table 3.1. This technique was usedin building cloud detection application scenarios in Chapter 5. PCA is a non-parametric method to extract information from high dimensional data sets.Eigenvectors of the co-variance matrix is computed from an image that repre-sents principal components of that image. These eigenvectors are often ortho-normal to each other. PCA extracts relevant features from data by performingan orthogonal transformation and in this way reducing a complex data set toa lower dimension [85]. The feature space is a space of clouds in our clouddetection application scenarios for image mining developed in Chapter 5.

3.2.2 Database-oriented approaches to data mining

Database-oriented methods do not search for a model, as the machine learningor the statistical methods do. Instead, the data modelling or other database-specific heuristics are used to exploit the characteristics of the data in hand.Implicit associations in the data can reveal hidden patterns when these aremade explicit through the database modeling and design.

A mining activity requires data preparation and data transformation phaseto transform the database into an algorithm-compatible format. Indeed, itwould not be feasible to reorganize data in a database each time for a new min-ing algorithm. However, it is more applicable to provide minimum databasedesign arrangements that can be useful to fulfil the minimum format compat-ibility criteria for many mining algorithms. This is a reason why data ware-houses are designed to keep many data mining and other knowledge discoverytools and techniques in view and vice versa.

There are two approaches to provide a mining algorithm for large databases.A mining algorithm can be external to the DBMS in the form of libraries orcan be integrated with the DBMS in the form of stored procedures. The datain latter approach does not leave the database for mining process. This inte-

22

Chapter 3. Data mining methods

Table 3.1: Statistical methods for data mining.

Data min-ing task

StatisticalTask

StatisticalMethods

Description

Data prepa-ration

Outlier De-tection

Clusteringapproaches

Partitioning of data into a number of clusters, where each datapoint can be assigned a degree of membership to each of thecluster [47].

Distribution-based

Based on deviation from some probability distribution likeNormal, Poisson, and Gaussian.

Distance-based

An object O in a dataset T is an outlier DB (p, D), if at leastfraction p of objects in T lies greater than distance D from O[47].

Density-based

It assigns each object a degree to be an outlier. This degree iscalled the local outlier factor (LOF) of an object and dependson the isolation of an object to its surrounding neighbourhood[48].

Segmentation Classification DecisionTrees

A predictive classification model basically hired from machinelearning technique that represents a set of decisions whichgenerate rules for classification of a dataset. A decision ortest on attributes values is performed based upon the prioritythrough some statistical criterion such as entropy, informa-tion gain, Gini index, chi-square test, measurement error, andclassification rate. Further, discussed under machine learn-ing.

Linear Re-gression

A predictive model when the target variable is continuous.The disadvantage is, assumed normal distribution of responsevariable is violated sometimes.

Bayesian In-ference

A statistical inference in which evidence or observations areused to update or to newly infer the probability that a hypoth-esis may be true. The name Bayesian comes from the fre-quent use of the Bayes theorem in an inference process. Themost widely used methods are Naive Bayesian classification,assuming that attributes are all independent, and Bayesianbelief networks, assuming that dependencies exist among sub-sets of attributes. Representing dependencies among randomvariables by a graph in which each random variable is a nodeand the edges between the nodes represent conditional depen-dencies is the essence of the graphical models that are playingan increasingly important role in machine learning and datamining [49].

Clustering StatisticalCorrelation

To find an association between fields in data. The most com-mon methods are: Principle Component analysis, and Maha-lanobis Metric.

K-means The k-means algorithm is a distance-based clustering algo-rithm that partitions the data into a predetermined number ofclusters (provided there are enough distinct cases). K-meansalgorithm works only with numerical attributes. Distance-based algorithms rely on a distance metric to measure thesimilarity between the data points.

DependencyAnalysis

Dependency analysis involves finding rules to predict the val-ues of some attribute based on a value of other attributes.

AssociationRule Mining

Associations between the independent instances of datasetsthat form rules. Mostly used as database transaction-basedapproach for revealing the interesting patterns in businesstransactional data. Further discussed under the database sec-tion

Bayesian Be-lief Networks

Discussed under Bayesian inference section above

23

3.2. Classical data mining

grated approach has been adopted by Oracle for data mining and its advantageis declarativeness. The former approach is adopted by TerraLib and its advan-tage is flexibility. A data miner has more control on data inputs and outputsand can rapidly develop prototypes on various DBMS platforms using differentprogramming interfaces.

Data mining algorithms deal with a large number of variables and dimen-sions. It is obvious especially for very large databases, to scale a mining algo-rithm for computational efficiency and performance. An algorithm is said to bescalable if, for a given amount of main memory, its runtime increases linearlywith the size of input [60]. Existing data mining activities are much driven bypractical computational concerns.

In many cases, data mining algorithms for large databases are not so simpleto scale linearly. Scalability of the mining algorithms then becomes one of themajor activities under database management. The database is designed in away to reduce execution steps in a mining algorithm. The database is tunedfor performance to effectively avail DBMS provided performance support suchas indices, optimizations, and hierarchal storage structure. Further, externalarrangements to elevate the performance can also be provided, for instance,parallel processing and increasing computation power through GRID comput-ing.

We will discuss some techniques that have been extensively applied in thedatabases during various stages of the data mining.

Data preparation, reduction and cleaning

SQL and user-defined functions offer a great degree of flexibility and efficiencyin extracting information from very large databases having heterogenous datasets[51]. The information is extracted using these functions in selection, projection,and joining of database records. Useful SQL capabilities include mathematicaland analytical functions [52]. In an ideal situation, a data mining algorithmshould be designed to directly access the database using query tools. But nor-mally, a time-consuming procedure is involved in transforming the databaseinto an algorithm-compatible format [21].

During the data preparation phase, datasets are reduced or arranged at dif-ferent levels of information according to the dimensions. Selection is performedto determine a subset of records from the database, to clean noise and duplica-tion, and to handle missing values. A database join is performed to integratedata sets and to provide flexibility for data manipulation. Data summaries aregenerated from aggregate and analytic functions. The traditional SQL queriescan help in data preparation phase, however, these are inadequate for exploringpatterns.

Multilevel data generalization, summarization, and characterization

Data generalization is a process that abstracts a large set of relevant low-leveldata in the database to the higher level concepts. The task of characteriza-tion is to find a compact description for a selected subset of data in the the

24

Chapter 3. Data mining methods

database [53]. Data generalization is important because most databases havebeen designed for OLTP (on-line transaction processing) where the aim is toserve small transactions. A mining process however, involves scans on largesearch spaces that require better access structures, optimizing disk I/O andother performance options.

A first approach in data generalization is to generate summary rules thatare normally implemented in the form of a data warehouse (DW). The generalidea is to materialize certain expensive computations that are frequently re-quested, especially those involved in aggregate functions for knowledge discov-ery. The DW integrates different types of data and reduces the dimensionalityof data inside the databases. It reduces runtime computational costs while scal-ing a mining algorithm.

This database-oriented technique for data mining was used to build a sce-nario for our cloud patterns detection image mining application. The summarystatistics for the principal components were generated and stored in speciallydesigned schema. The vector data summaries were also calculated and stored.This summary data can be used to construct spatio-temporal queries while min-imising the run-time computational costs.

The most popular data mining and analysis tools associated with data ware-houses carry several alternative names, for instance, online analytical process-ing (OLAP), multi-dimensional databases, multi-scale databases, data cubes,and materialized views [54]. A well-designed DW supports all such knowledgediscovery tools and techniques.

Structured OLAP tools include roll-up (increasing the level of aggregation),drill down (decreasing the level of aggregation), slice and dice (selection andprojection) and pivot (re-orientation of a multidimensional data view) [55].OLAP tools provide us the most powerful query techniques that can be effec-tively implied in a knowledge discovery process but they are not able to dis-cover patterns by themselves [56]. A powerful and commonly applied OLAPtool for higher-dimensional data aggregations is the data cube. The data cubeis an N-dimensional generalization of the more commonly known SQL aggre-gation functions and group-by clause [57]. The schema for image and vectordata summaries that was populated for cloud patterns detection image miningapplication in Section 5.4.2 can also serve such OLAP tools.

The second approach is attribute-oriented induction, which in contrast withthe DW, is on-line generalization-based data analysis technique. This techniqueis based on learning-by-examples algorithms from AI and machine learning in-tegrated with database operations (i.e., group by) [58]. The main goal is togeneralize low-level data into high-level concepts using the conceptual hierar-chies form the data itself in a self-organizing way rather to design and storedexplicitly by a domain expert.

Association rule discovery

Another kind of pattern that can be extracted from a database are the associa-tion rules between independent instances of the database.

The iterative database method is employed to search for frequent item sets

25

3.2. Classical data mining

in a transactional database. The association rules are derived from those dis-covered frequent item sets [50].

Association rule discovery can be formally defined as “An association rule isan expression X=> Y (c%, r%) where X and Y are disjoint set of items from thedatabase, c% is confidence. This is the proportion of the database transactionscontaining X that also contains Y in other words the conditional probabilityP(X|Y), and r% is support. This is the proportion of the database transactionsthat contain X and Y, i.e., union of X and Y, P(X

⋃Y) ” [59]. This database

oriented technique for data mining can also be applied on the attribute dataproduced for the vector and image data in Section 5.4.2 of this thesis for ourcloud patterns detection image mining application.

High computational cost is involved as each database item is visited in therule discovery process. Existing association rule mining algorithms are apriori-like approaches.

3.2.3 Machine learning approaches for data mining

Machine learning (ML) is a broad subfield of artificial intelligence (AI) that par-ticipates in development of algorithms and techniques for building computersystems that can “learn” from data. ML and data mining share the same broadgoal of finding novel and useful knowledge in the data, therefore they havetechniques and processes in common. The fundamental difference between MLand data mining exists in the volume of the data being processed [60]. Likestatistical methods, machine-learning methods search for the best model thatmatches the testing data. Unlike statistical methods, the searching space is acognitive space of n attributes instead of a vector space of n dimensions [61].Machine learning algorithms are modified when these are hired for data min-ing, according to a range of data types, tasks and methods for data mining.

We will discuss some of machine learning techniques that have been usedextensively in data mining:

1. Decision tree is a flowchart-like structure consisting of internal nodes,leaf nodes, and branches. Each internal node represents a decision or atest on a data attribute, and each outgoing branch corresponds to a possi-ble outcome of the test. Each leaf node represents a class [62]. Decisiontrees are extensively used in mining applications when a decision for aclass is based on many different data sources. VIZ, CART, CHAID, ID3,C4.5 and C5.0 are five common tree algorithms. Decision trees are usedin data mining for pattern discovery in highly multivariate data. Thisapplies to categorical output when the target variable is discrete.

2. Neural networks are information processing devices that consist ofa large number of simple nonlinear processing modules called neurons.Neurons are connected by elements that have the information storageand programming functions. Decision boundaries are calculated basedon rules that are built during the training process. A decision rule is tofind the proper value of weights associated with the interconnections ofneurons. There is no fixed decision rule, as it is evaluated iteratively by

26

Chapter 3. Data mining methods

minimizing some error criterion on the labeling of training data. Neuralnets have been widely used as classification technique in supervised datamining.

3. Rule induction produces a set of if-then-else rules from a database. Un-like decision tree methods that employ the divide-and-conquer strategy.Some popular methods include CN2, IREP, RIPPER and LUPC.

4. Hidden Markov models are statistical models in which the system be-ing modeled is assumed to be a Markov process with unknown parame-ters, and the challenge is to determine the hidden parameters from ob-servations. Recent finite-state-machine methods including maximum en-tropy Markov models (MEMM) and conditional random fields (CRFs) haveshown high performance in various structured prediction problems [60].

5. Kernel methods A kernel is a function that transforms the input datato a high-dimensional space to solve a problem. Kernel functions canbe linear or nonlinear. The support vector machines (SVM) have beenwidely used as kernel methods in data mining for both structured andnon-structured data such as text and image data.

3.3 Spatial data mining

Spatial data mining extends the data mining techniques with spatial concepts.Primitives of spatial data mining are based on the concept of neighbourhoodspatial relations. Spatial operators such as topological distance and directionare combined with logical operators to express a more complex neighbourhoodrelation. Such neighbourhood consideration in the data mining process makesit more complex [53].

The differences between classical and spatial data mining are:

1. Classical data mining deals with numbers and categories. In contrast, thespatial data is complex and includes extended objects such as points, lines,and polygons.

2. Classical data mining works with explicit inputs, whereas the spatialpredicates are often implicit in terms of topological relations such as over-lap and within.

3. Data samples for statistical procedures in classical data mining are highlydimensional and independent, however, the data in the spatial data min-ing is highly dimensional as well as auto-correlated.

4. The spatial data varies across several scales of resolution. Spatial de-pendencies (geographic dependencies in which some objects are alwaysrelated to other objects) on a small scale turn into random variation whenanalysed using broader units of measurement. Irwin [13], performed

27

spatial analysis at several spatial scales to examine substantive scale de-pendencies in the underlying processes that influence urban land use pat-terns. It was observed that the estimated parameters of the regressionmodel vary significantly across different spatial scales of analysis.

Making the implicit relations between geographic objects explicit is vital inany spatial knowledge discovery process. Topological relation, distance, and di-mension are implicit to the geographical data. These are made explicit throughidentifying objects, defining ontologies, and measurement of spatial autocorre-lation. “Ontology is a content theory which contains a general set of facts to beshared and whose main contribution is to identify specific classes of objects andrelations that exist in some domain ” [63]. Spatial autocorrelation is a measureof dependencies among neighbouring objects in space.

Many analogous terms are used for identification of objects: filtering, featureextraction, finding dependencies or using spatial dimension in the discovery pro-cess. Feature extraction is performed either prior to or during the mining stage.Identified objects and relationships between the objects are represented as at-tributes and one-to-one or one-to-many relationships in the conceptual schema.

3.3.1 Spatial statistics

The difference between classical and spatial statistics is similar to the differ-ence between classical and spatial data mining. Data samples in the spatialstatistics are not independent and are highly correlated. The spatial autocorre-lation is an area in the statistics that is devoted to the analysis of spatial datain the spatial statistics. The spatial autocorrelation is a measure of dependencybetween the spatial data variables. Spatial distribution of values/classes of acertain attribute sometimes shows distinct local trends which contradict globaltrends. Measures of dispersion such as range, standard deviation, and varianceinclude weight of proximity when calculated from geographic distributions [21].Various spatial statistical methods for spatial analysis in spatial data miningare provided in Table 3.2.

Spatial data quality in terms of accuracy also significantly contributes tothe spatial statistical results. Environment interference, acquisition deviceslimitations, and transmission process, etc., induce errors in the spatial data.Fuzziness of geographic phenomena and vague object definition further haveimpacts on the spatial data quality [64]. Another influencing factor especiallyin spatial statistical aggregations is the choice of resolution that arises mod-ifiable areal unit problem which comprises the scale and zoning effects. Thescale effect may lead to different statistical results if information is grouped atdifferent levels of spatial resolution. The zoning effect refers to the variabilityof statistical results if borders of spatial units are differently chosen at a givenscale of resolution. Both effects need to be carefully handled when aggregatingspatial data in a spatial statistical analysis [65].

Chapter 3. Data mining methods

Table 3.2: Statistical methods for spatial data mining.

Data miningtask

StatisticalTask

StatisticalMethods

Description

Data prepara-tion

Spatial OutlierDetection

A spatial outlier is a spatially referenced object whose non-spatial attributevalues differ significantly from those of other spatially referenced objects inits spatial neighbourhood [67].

Graphical tests Based on visualization of spatial data like variogram clouds and Moran scat-terplots [66].

Quantitativemethods

Provide a precise test to distinguish spatial outliers from the remainder ofdata.

Factor-basedmethods

Efficient local outlier discovering method. This assigns each object a spatialoutlier factor (SOF), it is a degree to each object being deviated from its neigh-bour [67].

Interpolationand Estimation

Statistical tools for modelling spatial variability and interpolation (prediction)of attributes at unsampled locations.

Segmentation Classification Classification in spatial statistics, considers spatial correlation between ob-jects when modelling the spatial dependencies between the objects. The sim-plest way to model spatial dependencies is through spatial covariance. Thetwo methods that have been largely implied to model spatial dependenciesinto classification/prediction problems are described as:

Spatial Au-toregressionModel. (SAR)

Linear regression models assume variables are independent, whereas spatialAutoregression models incorporate spatial auto correlation and neighbour-hood relationship in classical linear regression model. If spatial autocorrela-tion coefficient is statistically significant, then SAR also quantifies the spatialautocorrelation [66].

Markov Ran-dom Fields(MRF)

MRF-based Bayesian classifiers estimate model using MRF and Bayes’s rule.A set of random variables whose interdependency relationship is representedby an undirected graph (i.e. a symmetric neighbourhood matrix) is called aMarkov Random Field. The Markov property specifies that a variable dependsonly on its neighbours and is independent of all other variables [21].

Spatial Clus-tering

Spatial clustering is based on the fact that objects in space are grouped to-gether and exhibit patterns, however, statistical significance of spatial clus-ters should be measured by testing the assumption in the data. One of suchmeasure is based on quadrates which is well defined area, often rectangularin shape. Statistics are derived from the counters calculated from locationand orientation of quadrates. After the verification of statistical significanceof spatial clustering, classical statistical algorithms can be used to discoverinteresting clusters [66]. Spatial clustering methods incorporate both spatialand non-spatial attributes of the object.

HierarchalClusteringMethods

Start with all patterns as a single cluster, and successively perform splittingand merging until a stopping criterion is met.

PartitionalClusteringMethods

Data points are allocated to find the clusters of spherical shape in iterativeway. Squared error is the most frequently used criterion function. K-meanand k-medoids are commonly used partitional algorithms. A K-medoids algo-rithm can also be scaled by taking the advantage of spatial index structurelike R-Tree for clustering a large spatial database. Some recent algorithmsin this category are: partitioning around k-medoids (PAM), clustering largeapplications (CLARA), clustering large applications based on random search(CLARRANS) and expectation-maximization (EM) [21].

Density-based Clustering algorithms try to find clusters based on density of data points inregion. Examples are: density-based spatial clustering of applications withnoise (DBSCAN), and density-based clustering (DENCLUE) [21].

Grid-based First quantize the clustering space into a finite number of cells and then per-form the required operations on quantized space. Examples are: statisticalinformation grid-based method (STING), wave cluster, BANG-clustering, andclustering in quest (CLIQUE) [21].

DependencyAnalysis

Spatial Co-location

The co-location pattern discovery process finds frequently co-located subsetsof spatial event types given a map of their locations . It Measures spatialcorrelation to characterize the relationship between different types of spatialfeatures. The measures of spatial correlation include the cross K-functionwith Monte Carlo Simulation, mean nearest neighbour distance, and spatialregression models [66].

Spatial Asso-ciation RuleMining

Discussed in spatial database-oriented approach.

29

3.3. Spatial data mining

3.3.2 Spatial database approach to data mining

A spatial object is described by its geometric properties such as shape, location,area, etc., and the topology in which it is embedded: relationships between itand other. Spatial objects are stored as geometric attributes along with non-spatial attributes in the spatially extended database. Topological relationshipsare not explicitly stored and calculated at run time. Spatial objects can bequeried and analysed along with other non-spatial attributes through SQL in adeclarative way.

The SQL for a spatial database supports: the topological operators such aswithin, touches, intersects, etc., and the spatial functions such as measurement,management, transformation, etc. Spatial joins are based on the topologicalrelationships. SQL operators and functions also take the advantage of perfor-mance and fast retrieval measures such as spatial indexing, hierarchal storagestructure, and optimization from the spatially extended database. The spatialdata can also be pre-processed and materialized with the spatial database. Thismakes spatial databases as the most suitable platform for the spatial data min-ing as compared to databases for classical data mining.

Spatial data aggregation and spatial association rule discovery are the mostexciting spatial database-oriented techniques under spatial data mining.

Spatial data model

A spatially extended database is a repository of integrated spatial and non-spatial data. Implicit geographic objects are made explicit through concep-tual and logical data models that identify and define objects and relationshipsamong them. Modelling of the spatial data involves: modelling shape and lo-cation of objects, modelling dependencies and topological relationships amongthe spatial objects, and modelling continuous geographic data like elevation.Unlike statistical methods, the search space is a cognitive space of N spatialobjects with spatial and non spatial attributes instead of a vector space of Ndimensions, however, for continuous geographical phenomena, the statisticalmethods are extensively used to quantify the dependencies in terms of autocor-relation.

Modelling space, feature identification, and feature extraction are analogousterminologies used in literature. A spatial feature has some spatial and nonspatial characteristics of a geographic object. A layer represents a distinct setof geographic features in a particular application domain. Both have the aim toextract implicitly encoded information of spatial relationships. Various modelshave been proposed and implemented in spatial databases. We will discuss thetwo most adopted modelling techniques by the spatial database community.

The first approach is the object model that has OGC consensus and is mostwidely used. It has also been implemented in the PostGIS and Oracle. Thingsin the object model are treated as objects. Spatial geometry is realized throughthe spatial data types such as point, line, and polygon. The spatial relationshipsare described by the topological, directional, and metric units.

The second approach is the neighbourhood paths and graphs proposed byEster [53]. A neighbourhood relationship is represented by a graph having N

30

Chapter 3. Data mining methods

nodes and E edges. Each object is a node and two nodes are connected by anedge. The edge represents length and direction. The node represents topologicalrelationships. This is a highly effective method in modelling the dependencies,but it calculates and stores some unnecessary relationships resulting in highcomputational costs.

A graph-based approach is extremely suitable for explicitly storing the spa-tial object relationships in non-structured data like images to satisfy similaritysearches for object retrieval from the image database. Different graph mod-els such as attribute relational graphs (ARGs) and the region adjacency graphs(RAG) are employed to aggregate image regions.

Multilevel spatial data generalization and aggregation.

We have already discussed for classical data mining that a data warehouse inte-grates different types of data, reduces dimensionality of data, and reduces run-time computational costs; therefore, it helps in scaling the mining algorithmsto find patterns in the database.

Elzbieta [68], defines terminology for a spatial data warehouse (SDW). Aspatial level is defined as a level for which the application needs to store spatialcharacteristics such as country, state, and city. A spatial attribute is definedas an attribute that has a geometric type such as point, line, and polygon forspatial data. A hierarchy defines the navigating path for roll-up and drill-downalong a dimension. All attributes in a hierarchy belong to the same dimension.A dimension is the same category of information for instance, year, month, day,and week is a hierarchy in the time dimension. A spatial hierarchy is definedas a hierarchy that includes at least one spatial level. A spatial dimensionis defined as a dimension that includes at least one spatial hierarchy. Usuallynon-spatial dimensions, hierarchies, and levels are called thematic. The relatedspatial levels in spatial hierarchy exhibit a topological relationship. A spatialfact relationship is defined as a relationship that requires a spatial join be-tween two spatial levels. The spatial join is based on topological relationships.These definitions give some brief overview of how a SDW designer thinks whiledesigning a spatial data warehouse.

Generalization involves the construction of hierarchies of spatial objectsbased on spatial and non-spatial attributes. A spatial hierarchy represents asuccessive merging of neighbourhood regions into larger regions. In data ware-housing and OLAP, these spatial hierarchies allow both a detailed view and ageneral view of data using roll-up and drill-down operations.

Aggregation is the process of grouping multiple individual objects based onsome criteria to form a new composite object. Aggregation is an important op-eration given its ability to reduce spatial complexity while at the same timeretaining most thematic information of the component objects. Aggregation canbe based upon similarity of the objects involved or on the functional relation-ships between them.

Similarity of the objects is performed through comparison based upon thescale measurement (i.e., nominal, ordinal, interval and ratio), the geometricspatial reference such as a co-ordinate system, and the semantic spatial refer-

31

3.3. Spatial data mining

ence such as a place name. The classification (categorizing things into classes)based on such similarity at different levels leads to the classification hierar-chy or taxonomy. The functional relationships lead to the formalization of geo-ontologies or partonomies, which can be either user-defined, or derived fromstatistical properties in the dataset itself through the quantification of the spa-tial autocorrelation. Defining ontologies on the abstraction of spatial entities isan active area of research [69].

Spatial data warehousing and spatial data mining are two complementarytechniques for the knowledge discovery. The first mainly focuses on the devel-opment of multidimensional spatial data models to support spatial data aggre-gation and spatial navigation. In contrast, spatial data mining deals with thedevelopment of algorithms for the discovery of complex spatial knowledge suchas spatial clustering and spatial association rules. Currently, a most excitingresearch area is the integration of multidimensional data modelling, OLAP anddata mining in a comprehensive system. Such integration will address differentaspects of data modelling, user interaction and complex knowledge extraction[70].

In our cloud patterns detection application in Chapter 5, summary datafor all vector polygons in the study area was calculated through the spatialaggregate functions. These summaries were then materialized in the spatialdatabase as a spatial data warehouse. For image data mining, the princi-pal component analysis algorithm was applied to the time-series images ofthe study area and statistics over the resultant PCs were materialized withthe database. The vector summary data such as total area of the study areawas joined with the time-series image statistics to construct a spatio-temporalquery for pattern analysis. This integration of vector data or any other non-spatial data with remote sensing image analysis results can be extremely use-ful for constructing knowledge in an image mining process. This also presentsa method for the integration of spatial data warehousing and image mining forintegrating various sources of information at different levels. This approachcan be effectively implemented for spatio-temporal pattern analysis such as fora specific phenomenon or a land cover type at various geographical levels.

Spatial association rule discovery

The concept of association rule from classical data mining is extended for spa-tial association rule discovery. It is extended through defining an associationbetween the objects based on spatial neighbourhood relations in terms of thespatial predicates rather than items. Spatial association rules consist of an im-plication of the form X=>Y, where X and Y are a set of predicates and at leastone element in X or Y is a spatial predicate [71].

At least three steps are involved in spatial association rule discovery [65]:

• Compute a spatial relationship between two objects as data pre-processingto lower high computational costs.

• Find frequent sets of predicates. A frequent predicate is one for whichsupport is at least equal to minimum support.

32

Chapter 3. Data mining methods

• Generate a strong association rule. A rule is strong if it reaches the mini-mum support and its confidence is at least equal to a threshold.

3.4 Image mining

The knowledge gap between much data and little information is further en-hanced by the semantic gap between the low-level image feature representa-tions and the high-level application concepts. The vital and most challengingdifference between the spatial and image data mining is that the later involvesidentification and extraction of the spatial features or objects from pixels of animage through image processing techniques. Spatial features can be extractedfrom the image using the techniques for feature extraction, segmentation, andimage classification. Implicit relationships between the extracted objects aremade explicit through providing semantics in identifying meaningful regions,for instance, identifying some geometric features from the series of images asdeforestation patterns. Knowledge is discovered at a high level after associatingthese deforestation patterns to application-specific concepts such as deforesta-tion due to certain human activities such as small and large farms constructedby farmers and different deforestation mechanisms result in deforested areasof different sizes and shapes. The application concepts are different classes ofspatial objects, which are associated to a specific application domain. Silva andCamara [74], defines image mining as associating the spatial patterns in theimage domain to the application concepts in the application domain. In general,the four stages in image mining are:

• Pre-processing the image data for error corrections, analogous to datapreparation in classical data mining.

• Feature extraction through image processing techniques such as to extractbuildings in a remote sensing image through the segmentation.

• Associate extracted features to the application concepts resulting in spa-tial configurations, for instance, residence buildings and factory buildings.

• Analyzing spatial configurations to obtain useful knowledge, for instance,the factory area reduced in last 30 years.

Silva and Camara [74], performed image mining for Amazon deforestationpatterns in a time-series image database in three phases:

• In the first phase, they recognized landscape objects for application do-main. A landscape object is an object defined as deforested area. Theyselected corridor, diffuse, and geometric shape patterns of deforestationthat were recognized based on factors such as shape and spatial arrange-ments in Amazon deforestation.

• In the second phase, these landscape objects were identified and extractedfrom a set of sample images in order to build a reference set of spatial pat-terns. This reference set was obtained through segmentation of sample

33

3.4. Image mining

images under a cognitive assessment process, in which a human special-ist associates landscape objects (from image) to spatial patterns typology(corridor, diffuse, geometric). Once the reference set of spatial patternswas built, the next phase was to use it to mine spatial configurations fromthe time-series of images in an image database.

• In the third phase, geometric structures emerging on clustering of an im-age were mapped with spatial patterns of deforested areas from train-ing data. This mapping of a deforested area for time-series of images t1,t2, t3,....., tn was stored as spatial configuration in the form of databaserecords. A spatio-temporal pattern would be a set of all spatial configu-rations mapped as deforested area for an image time-series over a longperiod like 30 years. Using these spatial configurations stored as thedatabase records, analysis was performed. The answers for different ques-tions were found such as “Did the area of large forms increase during tenyears? ”

In our cloud patterns detection application scenario for database-orientedtechniques for data mining, vector polygons representing the boundary of thestudy area were used to identify the pixels within the polygons. The princi-pal component analysis was then applied on the time-series of images and thestatistics over the output were stored in the database as records. A time-seriespattern analysis was then performed over the attribute data.

In the following sections we discuss image mining at a low-level for featureextraction, and to provide semantics to the extracted features at a high(er) level.

3.4.1 Low-level image analysis to feature extraction

At a low level important activities are the object identification and extractionfrom the pixels of a single image. A pattern can be considered as a uniquestructure that can be extracted from an image and that can describe a specificphenomenon such as a land cover type. In the context of remote sensing, apattern is a spectral signature. A spectral signature is a set of spectral radi-ances measured in different bands of a multispectral image. The pattern orfeature extraction based on pixels in a single image is a matter of image anal-ysis, image processing, and recognition. Classifications and segmentation aretwo widely used image analysis techniques for pattern extraction from images.

There are two approaches to pixel classification in remote sensing. One ofthem attempts to relate pixel groups with actual earth-surface cover types suchas vegetation, soil, urban area, and water. These groups of pixels are calledinformation classes. The other approach determines the characteristics of non-overlapping groups of pixels in terms of their spectral band values. The groupsof pixels in this case are known as spectral classes [75]. The former approach,where samples from information classes (training data) are used for learningand then for classifying unknown pixels to find patterns, is called supervisedclassification. On the other hand, the latter approach where at first the spectralclasses are found without a priori knowledge on the information classes andthen their relationship with the information classes is established using a map

34

Chapter 3. Data mining methods

or ground truth is called unsupervised classification. Unsupervised techniquesgenerally imply the use of statistical methods to decide on decision boundaries[76].

Image data quality is one of the concerns when dealing with the image atlow-level. At low-level, many uncertainties arise that are cause of limitations inland cover/land use classifications during the image mining process. The pixelin remote sensing has its own inherited problems such as physical process ofimage creation and mixed pixel problems that need to be considered before theimage mining process.

Both classification and segmentation have supervised and unsupervised tech-niques in image domain. The classification can be performed both at pixel-leveland object-level.

Pixel-level image analysis for pattern discovery

The image analysis at this level is based on characteristics of the single pixel.Pixel classification techniques used attempt to assign a class label to an indi-vidual pixel based on its spectral value or feature space. This space is a mul-tidimensional space that is created with different spectral bands of a remotesensing image. Unsupervised classification that includes various clusteringalgorithms such as ISODATA and k-means, considers only spectral distancemeasures in a statistical analysis for spectral grouping. Supervised classifica-tion such as the maximum likelihood classifier incorporates the prior knowledgealong with the spectral distance measures. Rule-based classifiers such as neu-ral networks and expert classifiers also rely on the spectral characteristics persingle pixel level in defining rules.

Traditional statistical classifiers rely exclusively on the spectral characteris-tics, but thematic classes are often spectrally overlapping [72]. Mining activityin spatial context extracts knowledge about:

• Patterns associated with land use classes and their evolution in space andtime.

• Identification/extraction of objects and their implicit relationships in termsof dependencies.

It is quite difficult to extract objects using individual pixel values withoutconsidering the context i.e., a pixel and relationships to its surrounding pixels[73].

Object-level image analysis for pattern discovery

The image analysis at object-level does not attempt to identify a single pixel,but rather a pixel and relationships to its surrounding pixels to formulate anobject or segment. Pixels have a relation to surrounding pixels that provide thebasis for grouping them into objects. The object or segment-based classificationapproaches try to combine the neighboring pixels with similar properties intoimage segments. The segments have spatial characteristics like shape, size,

35

3.4. Image mining

texture, colour, etc., spectral characteristics as well as spatial relationships.After segmentation, this information over the segments is used for classifyingthe objects. Once an image is transformed into objects having crisp boundaries(i.e., polygons), approaches in spatial data mining can also be directly appliedto image mining. Therefore, the segmentation is normally performed in anyimage mining activity. A platform that provides support for image mining musthave at least one segmentation algorithm.

Image segmentation methods are either knowledge-driven or data-driven.Knowledge-driven (top-down) methods of the supervised type and apply priorknowledge in the segmentation process. Data-driven segmentation algorithms(bottom-up) are either based on discontinuity or similarity of the intensity val-ues in an image. Discontinuity-based approaches partitions the image usingabrupt changes in intensity such as edge detection. Similarity-based approachespartition the image into regions of similar intensity. Thresholds, wavelet trans-form, mathematical morphology, fuzzy clustering, region-growing, and regionsplitting and merging are examples of methods in similarity-based segmenta-tion. Recent surveys indicate that region-growing approaches are well-suitedfor producing closed and homogenous regions [74]. TerraLib provides a regiongrowing segmentation algorithm.

While transforming an image into objects, the quality of image analysis de-pends on the relation between scale (abstraction) and spatial resolution. Spatialresolution is the area on the ground that is covered by a single pixel. Whereasscale is the magnitude or level of abstraction on which a certain phenomena canbe described. Merging of objects during segmentation is controlled by a scaleparameter. The same objects appear differently at different scales. The choiceof appropriate scale is done by domain experts and is directly related to theimage semantics.

Once the objects are correctly identified during the segmentation, the nextstep is to classify these objects. A number of objects are assigned to certainclasses based on class descriptions. The class description can be based on spec-tral information such as image bands and spatial information such as textureand colour over the segments of a segmented image. The classifiers that takeinto account spectral, spatial, and structural information while distinguishingdifferent segments are called contextual or structural classifiers. The contex-tual classifiers try to simulate the behaviour of a photo-interpreter in recogniz-ing homogenous areas in the image, based on the spectral and spatial propertiesof the images. The spatial contextual classifiers consider spatial relationshipsand dependencies between objects. The ISOSEG classifier is available in theTerraLib to classify segments or regions of a segmented image. It is a non-supervised grouping algorithm applied to a set of regions, which are character-ized by their statistical attributes of mean, covariance matrix, and also by area[78].

Another important activity that is performed after transforming an imageinto objects through segmentation, is to link the objects to form a topologicalnetwork. Each object in this topological network knows its neighbours, sub-objects, and super-objects. Silva and Camara [16], propose a graph miningapproach for a topological network of the objects that uses segmentation by re-

36

Chapter 3. Data mining methods

gion growing, followed by hierarchical region organization using a graph model.Each region of lower scale is contained in a region of higher scale, generatinghierarchical region adjacency graphs (hRAG’s), representing each region as avertex with attributes, and edge describing the topological scale of regions. Thetopological network and region hierarchy are the framework of knowledge basethat defines image semantics. For example “Water” close to “Building” defineswater body as lake and is in some urban area. In this way, implicit relationshipsbetween the objects are made explicit.

3.4.2 High-level knowledge discovery

At high-level, the major activity is:

• To provide the application-specific semantic concepts to the extracted ob-jects at low-level.

• To relate these semantically annotated objects over the time series of im-ages from the image database for knowledge discovery.

Other GIS data such as vector and non-spatial data from a spatial databaseis also employed, therefore the classifiers at this stage must be able to utilizemany different sources of data over time.

Some structural classifiers associate spatial patterns discovered from theimage with high level application concepts for knowledge discovery. The spa-tial patterns are extracted features or objects from an image and applicationconcepts can be any non-categorical attributes that can categorize those pat-terns. Most frequently used structural classifiers are machine learning (ML)classifiers such as C4.5, CART, etc., and Bayesian contextual classifiers.

3.4.3 Image mining from integrated image/GIS data analysis

Rogan and Miller [81], summarized four ways in which GIS and remote sensingdata can be integrated:

1. GIS can be used to manage multiple data types

2. GIS analysis and processing methods can be used for manipulation andanalysis of remotely sensed data (e.g. neighbourhood and reclassificationoperations).

3. Remotely sensed data can be manipulated to derive GIS data.

4. GIS data can be used to guide image analysis to extract more completeand accurate information from spectral data.

Image mining to drive knowledge and patterns through image processingand analysis, is the potential area where GIS data can be used to guide theoverall process. This technique was adopted in our cloud patterns detection im-age mining application development where the vector data was used to identifythe study area from image to guide the image mining process.

37

3.5. Summary

3.5 Summary

Data mining tools and techniques work on the attribute space or vector spaceof n-dimensions that are explicit, whereas spatial predicates are often implicitin terms of topological relations such as overlap, within, etc. These are madeexplicit through identifying objects, defining ontologies, and measurement onspatial autocorrelation. The image mining induces another step of image pro-cessing to identify objects from the pixels of an image before making them ex-plicit.

In this chapter methods, tools, and techniques involved in the mining tech-nology were reviewed, and selected for cloud detection image mining applica-tion. These methods are referenced in Chapter 5 in developing the image min-ing scenario applications using the TerraLib library and PG DBMS technology.

38

Chapter 4

A database applicationdevelopment method usingTerraLib

4.1 Introduction

The TerraLib GIS software library (TL) was adopted as a method to provide im-age support in the PostgreSQL/PostGIS database and to develop an image min-ing application based on integrated image/vector data analysis. This chapterexplains classes, conceptual model, and remote sensing image handling tech-niques provided by the TL library that were involved in the development of animage mining database application. Section 4.2 describes the set-up that hasbeen built for image mining, using the TL library and the PostgreSQL (PG)database technology. Section 4.3 explains various design patterns adopted inthe TL GIS library development. These design patterns were identified dur-ing an in-depth study of the TL library. Section 4.4 explains the conceptualmodel for vector and image data handling for storage, analysis, and visualiza-tion. Section 4.5 discusses various image storage options with illustrations andpotential applications. This section also describes the image data handling andmanipulation options adopted for our cloud patterns detection image miningapplication.

4.2 TerraLib, TerraView and PostgreSQL/PostGIS set-up for image mining

A set-up for cloud patterns detection image mining application based on imageand GIS data in a spatial database was built, as shown in Figure 4.1. Imageand vector data were stored in PostgreSQL/PostGIS (PG/PG) database. TheDBMS PostgreSQL and spatial extension PostGIS were installed on the LinuxDebian operating system. The TerraLib library (TL) were used for algorithmdevelopment in our mining application. The visualization interface TerraView(TV) was used for analyzing results generated by the TL application. Both the

39

4.3. TerraLib application development

Figure 4.1: A set-up for cloud detection image mining application

TL and the TV were installed on the same Linux database server, extendingthe PostgreSQL DBMS. The whole set-up can be accessed by an applicationdeveloper through the ITC local area network (LAN).

4.2.1 TerraLib dependencies on open source third party libraries

The TL library code relies on a number of open source software packages thatare used to support some of its kernel functions and to provide image support inthe PG. These third-party software packages are usually provided with the TLlibrary, however, a TL application developer should be familiar with these pack-ages while developing and compiling the TL library or application code used bythe TL library. The application developer needs to consider comparability is-sues between the TL library and its dependent software package versions. Alist of third-party software packages used by TL are presented in Table 4.1.

4.3 TerraLib application development

The TL library follows a generic programming paradigm for developing reusablelibraries, through the object-oriented programming language C++. The genericprogramming development focuses on finding commonality among similar im-plementations of an algorithm, and providing suitable abstractions, so that asingle generic algorithm can be used to realize many concrete implementations.Such programming style is extremely important, for instance, in GIS where anoperator needs to handle various data sets and procedures to perform a singletask. For example, various procedures to measure spatial autocorrelation canbe adopted for a set of points, a set of polygons, a TIN, a grid or a remote sens-ing image. Ideally, a spatial autocorrelation algorithm should be independent

40

Chapter 4. A database application development method using TerraLib

Table 4.1: Third-party libraries used by TerraLib for image support in PG

Software pack-age name

Description

zlib To compress/uncompress image data when storing in and retrievingfrom a TerraLib database.

libjpeg This is a JPEG image compression library. This library is used intwo contexts: providing an algorithm to compress raster data beforestoring in a TerraLib database and to decode/encode image data inJPEG format.

tiff To decode/encode raster data in TIFF/GEOTIFF format.shapelib To decode/encode vector data in shapefile format.libltidsdk.a Used for decoding/encoding of image data in MrSID format.libpq.a The PG client-side development package provides environment for

database application development. A front-end application handlesBLOB in the PG database through this client-side software pack-age. The normal PG DBMS server installation does not install thisclient-side development package, and the developer needs to installa version-specific package to connect to the PG server.

GCC compiler Used to compile the TL application code on Linux Debian. The com-parability between a GCC compiler version and a TL library versionneeds to consider.

of data structures on which it operates and vice versa. However, many GIS ap-plications provide a large number of monolithic functions, roughly the numberof data structures times the number of algorithms.

Generic data structures such as list, set, and map, and generic algorithmssuch as sort and search are provided in the generic Standard Template Library.The algorithms are completely separated from data structures. A generalizediterator is provided to connect an algorithm to the data structure that traversesthrough the data structure for selection and manipulation.

The TL library applies the generic programming paradigm to the develop-ment of a generic GIS library in a four-step process [31]:

• Finding regularities in the spatial data handling algorithms.

• Generalizing the regularities in these algorithms to provide requirementsfor traversal over data structures.

• Providing iterators based on the requirements for traversal over datastructures.

• Designing algorithms that use provided iterators.

A class in object-orientation is a template to create an object as an instanceof that class. A class has variables to encapsulate the data and state of an objectand methods to encapsulate the behaviour of an object. A class interacts with

41

4.3. TerraLib application development

other classes through the variables and methods that are provided to define aninterface for that class. The other classes implements (or realizes) the inter-face of a class by providing structure (i.e., data and state) and method concreteimplementations (i.e., providing code that specifies how methods will work).

A design pattern is a description or template to solve a geo-computationalproblem that can be adopted in developing a GIS library. Various design pat-terns have been adopted for generic programming to develop the TL library.This library follows the design patterns described by Gamma and Helm [80].It is important to understand the design patterns adopted by the TL classes aswell as the collaboration between different classes in a particular design pat-tern before using them in the TL application development. As discussed below,various design patterns were identified that have been adopted by the TL li-brary classes. The generic diagrams for design patterns provided by Gammaand Helm were modified for the TL classes.

1. Template Class

A template class is a feature of the C++ programming language that al-lows classes to operate with generic types. The type of a class is definedthrough providing parameters at run-time. This allows a class to work onmany different data types without being re-written for each one. The bestexample is the Standard Template Library, which is a fundamental partof C++ with aim to generalize data structures such as list and vector, anditerators so that they can be used for any data type or algorithm providedat run-time.

A similar approach has been adopted by the TL library to define for in-stance a generic iterator that can traverse over a data structure such asa generic image data structure independent of format. The TL libraryalso uses class templates to define subtypes of generic containers. For in-stance, the TeSingle class template is defined in the TeComposite class forparameterised implementation to handle an object built of a single atomicelement such as an object of line to construct a polygon.

2. Template Method

Function overloading is a concept of object-orientation in which the codefor a function is repeated, with only subtle changes for different parame-ters varying in data type. In the generic programming approach, we de-fine template method’s functionality so that its adapts to the data type(s)of its parameter(s), removing the need to repeat common code. A templatemethod further attempts to generalize the behaviour of a function or al-gorithm through function overriding. Function overriding is a concept ofobject-orientation in which a new version of a method is introduced thoughextending a class, with only subtle changes in behaviour of a method inthe parent class. Gamma and Helm [80], define a template method as“Define the skeleton of an algorithm in an operation, deferring some stepsto subclasses. Template method lets subclasses redefine certain steps ofan algorithm without changing the algorithm’s structure. ”

42

Chapter 4. A database application development method using TerraLib

In this way the notion of a generic template in the generic programmingis combined with object-orientation (inheritance) to define reusable designpatterns in the TL library. In this library, a design pattern as a template isdefined after analyzing a geo-computational problem and is implementedin two steps:

• A base class is a realization of an algorithm as generic template. Theinvariant parts of an algorithm are implemented with a base class.

• A specialised class is defined through extending the base class andimplements the behaviours that can vary.

3. Singleton Design Pattern

The singleton design pattern ensures that only one instance of a classcan ever be created for an application programme. A single instance isthen used by all operations within an application programme. Gammaand Helm [80], define the singleton design pattern as “Ensure a class hasonly one instance, and provide a global point of access to it. ”

The example class diagram in Figure 4.2 shows a unique precision is setglobally for all geometries in an application. That unique precision for allgeometric operations can be set at the entry point of an application. Forexample in this set precision call,

TePrecision::instance().setPrecision(TeGetPrecision(region->Projection()))

A precision from region geometry is set globally for all forth-coming geo-metric operations.

4. Factory Design Pattern

A factory method is designed as an interface to create different objects ofthe same category. A subclass then implements that interface and over-rides the factory method to create an object for a particular category atrun-time. The factory method lets a class defer instantiation to its sub-classes. Gamma and Helm [80], define the factory design pattern as “Pro-vide an interface for creating families of related or dependent objects with-out specifying their concrete classes and let subclasses to decide whichclass to instantiate. ”

The factory design pattern is adopted by TerraLib to define many GISclasses and algorithms. For demonstration, a factory design pattern forhandling projections in the TerraLib is shown in Figure 4.3. As shown,there are four actor classes and interfaces in a factory design pattern:the product, the creator, the concrete product, and the concrete creator.These four actor classes and interfaces are explained in the context of aprojection factory as follows:

43

4.3. TerraLib application development

Figure 4.2: Singleton design pattern adopted for TerraLib

Figure 4.3: Factory design pattern adopted for TerraLib

44

Chapter 4. A database application development method using TerraLib

• Product The product interface in the factory design pattern definesan interface for creating objects by the factory method. In this exam-ple, the product is an interface for creating object of a projection atrun-time.

• Creator The creator class in the factory design pattern declares thefactory method, which returns an object of the type product. In thisexample, The TeProjection class is a creator class that can delivermultiple projections as products to the client. The factory methodin the creator class is provided to define a high-level map projectiondefinition and geo-referencing of a satellite image.

• Concrete Creator The concrete creator of the factory design pat-tern extends the creator class by overriding its factory method for aspecific implementation. In this example, a concrete creator TeUTMextends the creator TeProjection super class to override its factorymethod for the UTM projection-specific details. The creator class re-alizes any of its extending classes through a technique as describedbelow:The TeFactory template class is defined having a list to realize an in-dividual family of factory, for instance, the factory of projections, thefactory of decoders, and the factory of databases. The TeProjection isthe abstract base class for factory of projections. The list for all con-crete creators extending the TeProjection (for instance, the TeUTMconcrete creator) is passed to TeFactory to realize the factory of pro-jections at run-time. This list for all concrete creators of a factory isinitialized at run-time due to a static member in each concrete creatoras all static members are initialized at the start of a programme. Theadvantage is any new concrete projection class can be added throughextending the projection class without recompiling existing classes orexplicitly registering with kernel.

• Concrete Product The concrete product of factory design pattern,is a product that is returned by its concrete creator. In this example,the concrete creator TeUTM implements the product interface to re-turn an instance of appropriate concrete product, which is the UTMprojection.

5. Strategy Design PatternIn spatial data handling, there are normally different ways to performthe same function, for instance, there is a range of algorithms to mea-sure spatial autocorrelation for a set of points, a set of polygons, a TIN, agrid or an image. The strategy design pattern generalizes algorithms anda particular algorithm can be implicitly selected at run-time dependingupon context. Gamma and Helm [80], define the strategy design patternas “Define a family of algorithms, encapsulate each one, and make theminterchangeable. ”

The strategy design pattern is adopted by TerraLib to develop many GISclasses and algorithms. For instance, a large number of algorithms exist

45

4.3. TerraLib application development

Figure 4.4: Strategy design pattern adopted for TerraLib

for encoding various raster formats. A strategy design pattern for gener-alizing these algorithms for various raster formats is shown in Figure 4.4.As shown, there are three actors in a strategy design pattern: the context,the strategy, the concrete strategy. These three actors in the frameworkof TerraLib implementation handle all encoding algorithms for differentraster formats at run-time. Their role is as follows:

• Context The TeRaster class acts as a context. A context class main-tains a reference to strategy object and calls the concrete strategythrough the strategy interface. Any request for a particular rasterformat decoding algorithm is made through the TeRaster context class.

• Strategy The abstract class TeDecoder acts as a strategy that de-clares an interface common to all supported encoding algorithms forraster formats. The TeRaster class uses this decoder interface to callan algorithm defined by a concrete strategy.

• Concrete Strategy A concrete strategy implements the algorithmusing the strategy interface. The class TeDecoderTIFF acts as a con-crete strategy that implements TIFF decoding algorithm through thestrategy interface TeDecoder. A concrete decoder such as the TeDe-coderTIFF knows how to return a value (as a double) for each pixelof a particular raster format such as TIFF. The concrete decoder im-plementations vary with raster format. A concrete decoder for a par-ticular raster format can be developed by extending the TeDecoderstrategy class without recompiling existing classes.

The context class TeRaster can perform raster operations without know-

46

Chapter 4. A database application development method using TerraLib

Figure 4.5: Iterator design pattern adopted for TerraLib

ing low level details of each raster format. The client requests the contextclass to create raster geometry for a particular image format. The con-text class then passes all required arguments such as a string identifier“.tiff” to the strategy interface. The strategy interface gets an instance ofthe TeDecoderFactory that implements the factory design pattern to selectan appropriate concrete decoder according to a string identifier. The con-crete decoder TeDecoderTIFF returns a value (as a double) for each pixelof the tiff raster format. The values returned by any concrete decoder arepopulated with only one instance of the TeCoord2D geometry class.

6. Iterator Design Pattern

The iterator design pattern generalizes iterators for data structure traver-sal by encapsulating their internal differences. An iterator is a general-ized pointer, which is pointing to all objects in a container, of varying type.Gamma and Helm [80], define the iterator design pattern as “Provide away to access the elements of an aggregate object sequentially withoutexposing its underlying representation. ”

The (STL) C++ library defines generalized iterators over container typesuch as vector, tree, and array for different algorithms such as search andsort. Instead of developing algorithms for specific container types, theseare developed for a generalized iterator category relevant to these variouscontainer types.

A TerraLib geometry adopts the iterator design pattern to develop a genericiterator for the external algorithms to traverse through the internal struc-tures. An iterator design pattern adopted for the TL library is shown in

47

4.4. Conceptual data model

Figure 4.5. As shown, there are four actors in an iterator design pat-tern: the iterator, the aggregate, the concrete iterator, and the concreteaggregate. These four actors in the context of TerraLib implementation todevelop an iterator over raster geometry are explained as follows:

• Aggregate An aggregate defines an interface for creating an iteratorobject. The TeRaster class acts as an aggregate to provide an interfacefor creating a generalized iterator object at run-time for an algorithmto traverse through a raster format.

• Iterator An iterator defines an interface for accessing and travers-ing raster elements. The TeRaster::Iterator is an iterator interfacethat allows traversal over the raster elements (pixels) in a similarway as the (STL) iterators traverse over the data structures such asa list. The Iterator can iterate over raster elements independent ofany raster format or any algorithm.

• Concrete Iterator The TeRaster::iteratorPoly class acts as a con-crete iterator that implements iterator interface with a restriction ofarea that allows to cover the raster elements (pixels) that are in orout of a specific region (polygon).

• Concrete Aggregate A concrete aggregate implements an iteratorcreation interface to return an instance of a proper concrete iterator.The class TeRaster is also acting as a concrete aggregate that providesbegin and end methods for creating the ItratorPoly i.e. an iteratorobject.

7. Bridge Design PatternThe bridge design pattern decouples an abstraction from its implementa-tion so that the two can vary independent of each other. For example, theTeGeomComposit is a template class that is provided as abstraction forhandling a hierarchy of geometries in the TL library. This class is used forinstantiating the different composite geometries. The TePolygonSet classextends the TeGeomComposit class for storing multiple polygons. Simi-larly, the class TeLineSet extends the TeGeomComposit class for storingmultiple lines.Another advantage is efficient memory management as an instance of theabstraction class maintains a reference to an object of the type imple-menter. Any new object points to same memory area as the first instanti-ated class object of similar type. For example, multiple copies of a polygonin an instance of the TePolygonSet are allowed to share same memory areathrough adopting the bridge design pattern.

4.4 Conceptual data model

The conceptual model for the TerraLib is built in the database as the Te Databaseclass is initialized in memory. This initialization requires an instance of a spe-cific driver class for a DBMS such as the TePostGIS for PG/PG. The data model

48

Chapter 4. A database application development method using TerraLib

Figure 4.6: TerraLib software architecture [79]

creates spatio-temporal data structures for storage and visualization. The TLlibrary distinguishes between data sources for storage of spatial data in thedatabase and data targets for visualization of spatial objects in TerraView. Thespatio-temporal data structures for data sources and data targets are main-tained by the TL kernel in the database and are core of the TL library. Thekernel includes classes for storage, retrieval, maintenance, manipulation, andvisualization of spatio-temporal objects in the PostGIS DBMS. On the top ofkernel structures, the TerraLib functions are built for spatio-temporal analysisand image processing. These functions are then called from various interfacesin a TL application development. The architecture of the open source TerraLibsoftware is shown in Figure 4.6.

4.4.1 Data model for storage

The abstractions database, layer, representation, and projection are related tothe source domain for storage organization and hierarchy of spatio-temporalobjects in the spatial database, as shown in Figure 4.7. These abstractions forthe source domain are explained as under:

• Database The te database table describes metadata and layers of spatialdata in the database.

• Layer The te layer table stores layer information in the database. Alayer is a container of spatial objects that share a set of attributes. Alayer is represented in memory as an object of Te Layer class and in theTL database as a record of the table te layer. A layer can be vector datahaving point, line, and polygon type objects or this can be raster data such

49

4.4. Conceptual data model

Figure 4.7: Conceptual data model related to source domain for image and vector datastorage in PG modified from [29]

50

Chapter 4. A database application development method using TerraLib

as elevation data and an image. A layer is created in the database when avector data file in an interchange format such as shapefile and MID-MIFor raster data such as a TIFF image is imported into the database. Layerscan also be created by the processing of other layers in the database suchas in our cloud patterns detection image mining application in the nextchapter, overlay intersection of the Netherlands boundary data layer andimage layer is stored as a separate layer in the database.

• Representation The te representation database table is used to managerepresentations of all geographical objects such as point, line, polygon,and raster in a layer. Each representation of a layer has a table insidedatabase whose name is geometry name appended with layer id, as shownin Figure 4.7. For example if a vector shapefile with layer id 152 has threegeometries line, polygon, and point, the TerraLib will create three indi-vidual representation records in the te representation database table andcorresponding three tables polygon 152, line 152, and point 152 in thedatabase. Each representation table has a geometry field spatial datawhose “GEOMETRY” type represents the spatial type provided by thespatial extension PostGIS.

The TL library also supports multi-representation of a geometry. For ex-ample a district can be represented by a polygon or a point depending uponthe scale. A centroid for a polygon is calculated when a new representa-tion of a polygon is required. It also supports complex representationslike cell space and networks. For the raster type, the TerraLib supportsmulti-dimensional regular grids.

• Projection As mentioned above, each representation table of a layerhas geometry field spatial data whose “GEOMETRY” type represents aspatial type provided by the spatial extension PostGIS. This represen-tation table name along with geometry column name is recorded in thegeomtery columns PG metadata table to assign a PG data type to thatTL representation. However, the TL geometries do not use projectionsprovided by PG leaving SRID column value “-1.” The TL library uses itscartographic projections defined in the kernel. A projection is representedin the memory as an object of Te Projection class and in the database as arecord of the table te projection. Metadata for all attribute tables associ-ated to a layer are recorded in the te layer table.

In case of raster data such as satellite imagery no complex data type isprovided by the spatial extension PostGIS therefore, BLOB is used foractual image data storage and retrieval in database.

4.4.2 Data model for visualization

The abstractions view, theme, visual, and legend are related to the target do-main for visualization of spatio-temporal objects. These visualization abstrac-tions describe data retrieval and presentation for front-end applications suchas TerraView. These abstractions for the target domain are:

51

4.4. Conceptual data model

• View Just like the database organizes layers of spatial data, a view or-ganizes one ore more themes containing spatial objects. A theme in aview represents a layer in the database. A view aggregates all layers thatshould be presented and handled simultaneously.

• Theme A layer in the database is added as a theme in a view when it isrequired as input for some GIS analysis or other functionalities providedby TerraView. A TL query retrieves tuples from layers, converts thesetuples into a set of objects, and groups objects in a theme.

The operator can add any number of themes in a view and select some orall themes as required to perform a function. A selected theme is calledvisible theme. A view can have any number of visible themes and oneactive theme. The active theme is the theme whose descriptive attributesare shown in the grid area in TerraView. All visible themes in a viewexcept the active theme are drawn from bottom to top. At last, the activetheme is drawn.

The themes are also used to make selection over the geographical objectsof layers. A restriction can be defined on conventional attributes, or onspatial or temporal properties of the objects. Spatial query is based onthe spatial relation such as within and touches among object geometriesof one or two themes.

• Visual and Legend The visual and legend represents a set of attributesfor grouping and visual representations of the data.

All these visualization abstractions create tables in the database and objectsin the memory.

The TL library provides a visualization interface called TerraView for theimage and vector data, and spatial analysis based on such data. However, theTL library also offers other options to develop an application for spatial analysison top of the TL library and provide a visualization interface for that applicationas itemized:

• An application developer can also develop a visualization interface aroundhis prototype rather to use the TerraView visualization interface. In thiscase, the TeApplication class provides a simple interface for visualizationof GIS data. It provides the run() and show() methods to construct the userinterface for displaying data, which may be a simple geometrical structureor a more complex layer.

• The second option is to build a GIS application on the top of the TL libraryand add this application to TerraView as plug-in to provide a visualizationinterface. The added plug-in works seamlessly with other functions pro-vided by TerraView.

Our patterns detection image mining application in chapter 5 was developedusing the TL library. The results are analysed through the TerraView visual-ization interface. The application will be added as plug-in to the TerraView toact like other functionalities provided by the TerraView.

52

Chapter 4. A database application development method using TerraLib

4.4.3 Image data handling in TerraLib Database

There are three main TerraLib classes to handle raster data inside a PG database:

• The TeRaster class, a generic raster data structure. An instance of thisclass is obtained for generic raster operations.

• The TeRasterParams class to set or get raster parameters such as tilingtype and block size.

• The TeDecoder is a parent class for all decoders to handle various imageformats such as tiff and jpeg. Abstraction classes for each image formatare defined by extending the TeDecoder class, for instance, the TeDecoder-Tiff decoder class for tiff image format.

A time-series of images can be imported into the database for a geographiclocation to build a land use/land cover change detection application such asurban growth and flooding. Each new image in the time-series will create aseparate layer that can be further overlayed with existing raster and vectorlayers in the database.

In case of a single image import, a record is inserted as metadata in each ofte layer and te representation tables. Further, three tables are created for stor-ing metadata and data at each image import. This means that if there are 100images to import, 300 tables will be created in the database. Database tablesin this case will be almost static so there will not be a performance issue sincePostgreSQL database can handle unlimited number of tables and unlimiteddatabase size. As the schema will constantly grow in this case, some databaserepository management considerations must be adopted by the database ad-ministrator. Alternatively, images can be imported as bands of a raster layer,in this case the number of tables will not grow but the tables will grow when anew image is inserted as a raster band. This can be useful if an algorithm needto process pixels from many images on the same geographical area through aprecise overlay of all image layers.

For our cloud patterns detection image mining application, an applicationprogram was developed that takes the directory path for images on file systemas input and creates a times-series database, inserting each image as a separatelayer.

Another option is to import images for a whole geographic area such as acountry or a continent. An image mosaic is used to import all images of a largerarea into a single raster layer in the database. An image mosaic will createonly three tables inside the database for any number of images imported intoa layer of the database. The tables in this case will be dynamic and grow ateach image import into the mosaic layer. Some performance considerationsmust be adopted by the database administrator. TerraAmazon is a real-timeAmazon deforestation monitoring system developed by INPE using open sourceTerraLib and PostgreSQL. A raster layer for the whole Amazon region has beencreated for the application. The PostgreSQL database stores approximately 2million complex polygons and 20 gigabytes of full resolution satellite images areadded every year, using TerraLib pyramidal resolution schema [82].

53

4.4. Conceptual data model

For our cloud patterns detection image mining application scenario throughimage analysis guided by vector data, all images were needed to store as sepa-rate layers. The search algorithm used, to clip image pixels for the study areafrom each image with the vector data of that area. This was achieved by over-laying each image layer with vector layer using the same projection system.Image processing algorithms were then applied on the clipped pixels in nextstep.

An image can be queried either through a direct SQL statement on imagedata tables or through set and get methods in an iterator provided by the TLlibrary over an image data structure. In the former case, a developer can ac-cess an image at block or tile level, size of which is decided while inserting animage. Spatial indexes will be used by the query in this case. In the latercase, a developer has finer-grain access at the atomic level and a single pixelcan be locked and manipulated explicitly by a developer. The generic iteratortraverses through image while setting or getting pixel values for an algorithmindependent of image format.

The image query criteria can be pixel values, attributes, or tiles of an im-age. Another option is to clip an object from the image layer with vector layerthrough an overlay operation. The query in this case will return the pixels forthe clipped object. The vector data can be either already stored in the databaseor resulting from spatial predicates such as touches and within, or set opera-tions such as union, difference, intersection, and overlay on geographical fea-tures. For our cloud patterns detection image mining application, the imagepixel values for the study area were queried through the overlay of the vectorand image data layers already stored in the database.

The raster data as a result of a query can be directly sent as input to animage processing or statistical algorithm. The output of an algorithm can behandled in the memory:

• To provide as input to another algorithm.

• To store in the form of an image on the disk or in the database.

• To store as a database record.

Any of such outputs can be further accessed for analysis in an image miningprocess or provided to front-end visualization applications. The output of animage mining algorithm can also be vector data such as vector polygons createdas a result of raster to vector conversion and region growing segmentation.

For our cloud patterns detection image mining application, the resultingimage as a result of vector/image overlay operation was provided as input toprinciple component analysis algorithm in the cloud detection process. Theoutput of the principal component analysis algorithm was stored as PC imagesin the database. The PC images were then accessed from TerraView interfacefor analysis. In another application scenario, PC images as output of the PCAalgorithm were not stored in the database rather handled on-the-fly to provideas input to another algorithm for statistical results. These results were thenstored in the PG database as records.

54

Chapter 4. A database application development method using TerraLib

When dealing with individual pixels, it is important to consider efficiencyissues. For computational efficiency, TerraLib implements a block cache mech-anism to reduce the computational cost for fetch operations from the database.Tiling and multi-resolution scheme provides spatial indexing. The unique iden-tifier associated with each tile is used as a primary key for a query and for thecache system. The cache system uses least recently used queues in memoryfor cache management. Maximum number of tiles in the cache can be set by anapplication developer according to the application requirement. The oldest tileswill be aged out by the system as the cache is full.

Most of the time the whole image is not required by visualization interfacedue to the display size limitations or to provide the zooming effect from coarserto finer resolution. For computational efficiency, TerraLib allows the pre-builtand down-sampled versions of the data to be stored during the image import,this is called a multi-resolution pyramid. When the user requests the display ofthe data, TerraView selects the smaller level of the pyramid, with a resolutionmore similar to the resolution defined by the display area.

The web is another important output interface for the dissemination and vi-sualization of geographic data. Open source TerraPHP is an extension of PHP,built with the TL GIS library. The TerraLib code can be embedded inside PHPscripting to facilitate the development of web applications for image visualiza-tion using pyramidal resolution, query on geographic databases, and a web mapserver (WMS) access to the image data. The raster data from the database isdelivered to WMS in PNG or JPEG format.

4.5 Summary

A GIS and remote sensing integrated analysis requires that the analysis pro-cedures are independent of data structures. Most of the current GIS packagesfollow separate procedures for different data structures. The TerraLib approachof generic programming makes spatial analysis independent of data structures.The design patterns provide the basis of such generic programming. The con-ceptual schema provided by the TerraLib allows querying both image and vectordata to perform overlay operations for integrated analysis on these data types.Further, the visualization interface provided by TerraLib gives the facility toanalyze the results. The TerraLib approach for integrated data analysis wasstudied in detail and is applied to build the image mining application scenariosin the next chapter.

55

4.5. Summary

56

Chapter 5

Application scenarios forimage mining: Results andDiscussions

5.1 Introduction

Chapter 3 discusses various methods for data mining. These methods werefurther reviewed for extended spatial and image concepts for spatial and im-age data mining respectively. In Chapter 4, we built and investigated a re-search platform using the TerraLib and PostgreSQL DBMS technologies, as amethod for developing advanced database applications with integrated image/vector analysis. In this Chapter, image mining was selected as an advanced ap-plication, to investigate the proposed method. Various statistical and database-oriented data mining techniques were applied to develop image mining appli-cation scenarios. The developed application scenarios ranges from a low com-plexity to relatively more complex. Through executing these scenarios, we alsoinquired the TerraLib implementation as an image data-handling scheme in ageneric object-relational database system. Different approaches to handle im-age and vector data for manipulation and knowledge extraction from a largespatial database were also investigated.

Section 5.2 explores the technique in which GIS vector data stored in thedatabase was used to guide an image processing process for knowledge dis-covery from a time-series image database. The principal components analysis(PCA) based statistical data mining algorithm was applied for cloud patternsdetection from a time-series image database for study area. Various advan-tages and performance issues for TerraLib approach were also discussed. Sec-tion 5.3 provides the method in which TerraView (TV) visualization interfacewas extended to support a temporal query on time-series image data stored inthe database. Section 5.4 provides a more complex image mining applicationscenario developed with the output of many image processing algorithms in asequence. The results were transformed to an attribute space in the databaseto apply database-oriented approaches for data mining.

57

5.2. Image mining guided by GIS data

5.2 Image mining guided by GIS data

5.2.1 Introduction

One of the important facilities that a DBMS provides to an integrated RS andGIS analysis is to store remote sensing data as a layer just like any other layerin GIS. A remote sensing image layer then can be overlayed with GIS datapointing at the same geographical area provided that both layers share thesame projection. Further, if the both layers do not have same projection, theoverlay operations perform on-the-fly transformations. The GIS layers can beused to guide remote sensing image analysis algorithms. For instance, remotesensing imagery normally covers a larger part of area on the ground, however,in many cases the researcher needs the spectral characteristics of a particulararea. It is always time-consuming to select those regions of interest in an imageduring an image analysis, particularly when executing an image mining processon large repositories of image data. The GIS data can be used to accuratelyidentify regions of interest within an image.

The second facility that a DBMS provides is to store and manipulate a largeamount of time-series image and GIS data seamlessly. The image analysis al-gorithms can access and manipulate large repositories of image data in a batchprocess. A batch process can be written by a data miner to execute image min-ing steps on the image data of many years. For a particular image miningapplication, a miner defines various image, statistical, and hybrid image/GISalgorithms in a sequence. The output of one particular task or an algorithm isprovided to another algorithm in the sequence. Results of such batch process-ing can be obtained, either in the form of database records or as images. Theseresults are further analysed to answer the research questions.

This approach is illustrated with an application to detect cloud cover overthe study area. The vector data was used to guide the principal componentanalysis algorithm in an image mining process. The objective is to illustrate apotential case where an image mining process is guided by GIS data, and notto precisely detect the clouds in providing accuracy assessments.

5.2.2 The Data preparation

The Spinning Enhanced Visible and Infrared Imager (SEVIRI) sensor on boardof the MSG satellite provides data to EUMETSAT station (Germany), whichis processed and then uplinked to HOTBIRD-6 in wavelet compressed format[82]. MSG data for larger part of Europe was extracted from the ITC datareceiver as shown in Figure 5.1(b). The images were retrieved for day time forthe dates December 13, 2008, and December 16, 2008 with temporal frequencyof 60 minutes. We have also a prior knowledge that the study area was clearfrom cloud cover at the former date, and that there were clouds and fog over thestudy area on the latter date. In this study, only band 1 (0.06 μm), band 3 (0.08μm), and band 9 (10.8 μm) were selected as the clouds are easy to recognize withthese bands. The visible band 1 can be used to ensure visual cloud validationfrom the principle components of that band. The area for the Netherlands was

58

Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.1: Clipping of research area from MSG satellite data with vector data

selected as study area, and vector data for the Netherlands boundaries wasobtained with WGS84 LATLONG projection as shown in Figure 5.1(a).

The next task was to construct a time-series image database with PG DBMSand uploading the images. It was inconvenient to upload the images manuallyone by one in a database, especially when the image data is massive coveringa long time span. It was also required that an image exposure date and timeas event time should be recorded to build a time-series image database. A util-ity programme was developed to build that time-series image database withPostgreSQL DBMS and uploading the whole image data from disk to databaseautomatically. The code for this utility program is provided as Appendix A. Thecode was written using the TerraLib library C++ language interface. This ap-plication programme is a command line utility that takes the directory path forimage data as input.

5.2.3 Method and results

There are many computational methods for cloud detection in which physicalproperties such as reflectance, temperature, and radiance are used as thresh-

59

5.2. Image mining guided by GIS data

Figure 5.2: An image mining process with integrated image and vector analysis with TerraLibon top of the DBMS technology

old to mark an individual pixel as cloudy. These physical characteristics aresometimes very difficult to estimate. They are estimated and provided as dataproducts to apply different cloud masks to identify cloudy pixels.

Another option is to mine the image data itself for cloud patterns detection.An approach was suggested by Elbert [83], which involves pattern recognitionbased on large scale texture analysis. The principal component analysis (PCA)is another statistical approach that follows an explanatory learning, and iden-tifies patterns and relations in data itself. As spectral signatures contain phys-ical characteristics of the type of land cover that include clouds, land, etc. ThePCA identifies patterns for different land cover types in data through reducingthe number of dimensions in terms of principal components, and expressing thedata in such a way as to highlight their similarities and differences. The detailsfor PCA method are provided in Chapter 3. The PCA technique was also anal-ysed thoroughly on SEVIRI multi-spectral image data for cloud patterns detec-tion by Amato et.al [84], and found to be very effective in detecting clouds. Theobjective in this research is not just to test the PCA method for cloud detection,but to apply it in providing the framework for integrated remote sensing/GISanalysis in an image mining process on top of the DBMS technology so as todevelop more advanced image mining scenarios.

An image mining batch process using the TerraLib C++ programming inter-face was written that retrieves images one by one from the time-series imagedatabase, performs a sequence of steps shown in Figure 5.2 in an iterative way,and stores the results in the database as images. These images were then ac-cessed and visually analysed with TerraView as shown in Figure 5.3. The code

60

Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.3: The resulting principal components for two dates at 14:00 hours

61

5.2. Image mining guided by GIS data

of the program is provided as Appendix B.The GIS vector data for the Netherlands was used for the selection of pixels

for that area from an MSG image as shown in Figure 5.1(c,d). The selectedpixels for the study area were handled in memory and were provided as inputto the PCA algorithm to generate principal components for three bands. Theseprincipal components were then stored in the PG database as time-series PCimages for visual interpretation using TerraView.

Figure 5.3 shows the PC images generated by principal component analysisalgorithm for a clear day of December 13, 2008, and a cloudy day of December16, 2008 at 14:00 hours. The presence of clouds can be noticed through thevisual interpretation from first principal components.

5.2.4 Discussion

A data mining approach based on the integrated image and GIS data analysison top of the database technology is illustrated here. This approach requiresthat the GIS data (vector model), and remote sensing data (raster model) storedin the database are able to perform overlay operations. The results of suchintegrated overlay operations should also allow to display for visual analysis,and should storable in the database for access by another analytical application.

During the data preparation phase ZLIB compression mode was used forstoring images in the PG database. It was observed that compression tech-niques can reduce considerable amount of data in the database as shown inFigure 5.4 based on statistics in Table 5.1, but actual benefit of compressionalong with tiling and index support provided by the TerraLib is to reduce thesearch space and time, and to increase the performances at acceptable level toconstruct large remote sensing image databases with a standard DBMS. Theperformances include: improved physical disk I/O, faster scan operations, im-proved network bandwidth in a client-server environment, and reduced mem-ory requirement. The compression is such a powerful option that it is becomingthe default for any high performance DBMS. There is some performance over-head when compressing at write and decompressing at read but this overheadis tolerable as compared to achieved performances. The TerraLib further re-duces the performance overhead in providing the compression/decompressionat tile level. The size of tiles can be adjusted according to read/write activity ofthe image data.

The image retrieval speed from the database can not beat that from the diskin some cases, however, the seamless overlay operations for integrated remotesensing and GIS analysis with adequate image retrieval performance from aspatial database has an extremely wide scope in advanced GIS applications de-velopment. A GIS application based on an integrated image and vector dataanalysis can be rapidly developed on a large repository of satellite images andGIS data stored in the database. Storing images on the disk as external files, isoptimal if the purpose is just to retrieve an image for dissemination or visual-ization.

62

Chapter 5. Application scenarios for image mining: Results and Discussions

Table 5.1: Image size on the disk and in the database

Images: 1 2 3Image size on disk 721 2651 6645Image size in database without compression 840 3032 8392Image size in database with compression 696 2032 3200%Increase in size without compression 16 14 26%Decrease in size with compression 4 13 107

Figure 5.4: Comparison of image size on disk with size in database using compression

63

5.3. Extending TerraView for a temporal image query

5.3 Extending TerraView for a temporal image query

5.3.1 Introduction

Remote sensing image databases can provide more frequent access to imagedata for a particular area of interest, and time that can be used in developingland cover/land use change detection applications. Such applications frequentlyaccess time-series image data at different temporal frequencies. A temporal im-age query provides the result set in the form of images or as database recordspopulated with image visual or statistical characteristics for a time-series vi-sual or statistical analysis respectively.

In the previous scenario, a time-series image database was created throughdeveloping an application programme, and the temporal information was storedinto the database as metadata. But the TerraView interface does not providethe facility to perform a temporal image query for visualizing the time-seriesof images. An operator needs to be constructed manually the visualization in-terface for time-series of images for a particular time-interval. This is difficultto do when the TerraView visualization interface does not have any option toview the metadata of the stored image data in the PG database. Therefore,the objective was to provide an option to execute a temporal image query fromvisualization interface to get stored PC images in the database for a specifiedperiod of time.

In the previous section, images were retrieved from a time-series imagedatabase as input for a mining algorithm. The GIS vector data was also storedalong with the image data, and was retrieved to define boundaries over the landcover for our study area. The image date and time was stored as metadata at-tributes. In this section, we illustrate how to perform temporal queries on atime-series image database, and how results are populated as TerraView viewsand themes for visualization.

5.3.2 Method and results

To visualize an image in TerraView for a specific date and time, an operatorfirst needs to query the database and obtain prior knowledge before manuallycreating views and themes on image layers for visualization. The TerraViewvisualization interface does not allow to query a time-series image databasebefore creating views and themes on image layers.

An application programme was developed using TerraLib and C++ on top ofthe PG DBMS, with the objective to extend TerraView with the facility to querya time-series image database for a defined time-period. This TerraLib applica-tion takes the date and the time as input to define a time-period. The scriptformulates a temporal query for remote sensing data layers in the databaseand retrieved images for that time-span are populated programmatically in Ter-raView as views and themes for visual analysis. The code for the application isprovided in Appendix C.

The code was executed to perform a temporal query for December, 13 2008for time interval of 09:00:00 AM to 12:00:00 AM. The view was created with the

64

Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.5: The views/themes populated as a result of temporal query

name displaying time interval, and themes for retrieved images were populateddisplaying data, time and band information as shown in Figure 5.5.

5.4 Database-oriented approaches to data mining

5.4.1 Introduction

Data modeling or database-specific heuristics are used to exploit the character-istics of attribute data. In image mining, such attribute data can be generatedthrough applying a mining algorithm on image data stored in the database andstoring the results as database records. The image mining algorithm is requiredto access the database for time-series image data through SQL statements, toprovide this data as input to a sequence of image processing steps in an itera-tive way, and to store the results as database records. Further analysis can beperformed on the attribute space of those database records. Database-orienteddata mining algorithms that work on the attribute space such as associativerule discovery can also be applied. Sections 3.2.2 and 3.3.2 of this thesis dis-cuss the database-oriented approaches for classical and spatial data mining.Association rule discovery is a database-oriented data mining approach to findpatterns from attribute data. The attribute data can also be used for machinelearning algorithms for data mining discussed in Section 3.2.3, among themC4.5 is extensively used. C4.5 tries to find pattern in multivariate attributespace through formulating decision trees and attribute vectors during the clas-sification of attribute data in a data mining process. Such classifiers associatesattributes such as mean, max, entropy, etc., as a result of image processing algo-

65

5.4. Database-oriented approaches to data mining

rithm with high level application concepts for pattern or knowledge discovery.The applications concepts can be any categorical attributes or GIS data thatcan categorize the results of image processing in an image mining application.

Database approaches normally use to materialize frequently asked expen-sive computations and aggregates in the database to scale data mining andOLAP (Online Analytical Processing) applications to meet high computationalcosts required for such applications. The summary data generated by aggregatefunctions are stored in especially designed database schemas such as the starschema for knowledge discovery. These database techniques also help to reducedata dimensionality and to integrate various sources of data. Such databasetechniques can be adopted in image mining through storing most of the com-mon spatial, temporal and statistical data for images along with other imageattributes. For instance, NDVI for a time-series of images can be calculatedand stored along with image date time, image statistics, spatial characteristics,and derived GIS data in a database schema designed to serve one or more datamining applications.

In classical data mining, algorithms are applied on attribute space, andthese attributes are either explicitly stored or derived. In spatial data mining,these attributes are implicit and need to be derived from spatial operations andspatial analysis. For instance, the Netherlands vector data used in our clouddetection application has thirteen polygons. A derived attribute for total areacan be the sum of geometry area of each polygon. The result can be materializedin the database.

5.4.2 Method and results

To illustrate the above approach, a database schema was created to store theresult of an image mining algorithm as database records. In Section 5.2, time-series image data was generated from the intersection of MSG image data withvector data of the study area. In the next step, A PCA-based statistical miningalgorithm was applied on these images for cloud patterns detection. The result-ing time-series PC images were stored in the PG database. In this scenario, PCimages as output of the PCA algorithm were not stored in the database ratherhandled on-the-fly to provide as input to another algorithm for statistical re-sults. These results were then stored in the PG database as attribute data. Asequence of steps shown in Figure 5.6 was carried out in an iterative way ontime-series image data stored in the PG database and the results were storedas database records along with image name, band id, and date time of the im-age as shown in Figure 5.7. The code for this image mining application scenariois provided in Appendix D.

The statistical results for the time-series PC images generated from PCA-based image mining algorithm were stored in a PG database table as records.These records were accessed from MS Excel through establishing a connectionto the PG database table for time-series pattern analysis. Statistical mean val-ues generated from PC-1 time-series images were plotted for the dates Decem-ber 13, 2008 and December 16, 2008 at the temporal frequency of one hour asshown in Figure 5.8 and Figure 5.9 respectively.

66

Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.6: A sequence of steps in a mining process to generate attribute data in the PGdatabase

67

5.4. Database-oriented approaches to data mining

Figure 5.7: Attribute data for PC images as a result of PCA algorithm applied on time-seriesMSG data

By comparing these results with the visualization analysis from TerraViewin Figure 5.3, the changing mean value at 14:00 hours can be observed as thepresence of clouds. This value was used for space-time annotation of an imageas cloudy for the Netherlands area at a particular time in our cloud searchengine application.

The efficient storage and retrieval of time-series images in the PG databaseenabled the derivation of statistical data over time as multi-temporal meanand variance which was then included in the subsequent time-series patternanalysis. The vector data for the study area was also processed to generate at-tribute data and this was also stored in a database table as records. The spatio-temporal queries joining the vector and image attribute data were generated,for instance,

select i.name, i.study_area_id, i.image_datetime, v.geom_area,i.variance, i.correlation from image_summary_data i,vector_summary_data v where i.study_area_id=v.study_area_id ANDband_id=0 AND to_timestamp(i.image_datetime, ’YYYYMMDDHH24MISS’)between to_timestamp(’20081213100000’,’YYYYMMDDHH24MISS’) ANDto_timestamp(’20081216100000’,’YYYYMMDDHH24MISS’);

As shown the spatial-temporal query provides the information about thestudy area derived from the vector data along with the statistical results of PCAimage mining algorithm as information on a land cover class i.e., clouds. Thisintegration of GIS data with the results of a time-series image analysis is ex-tremely useful in constructing knowledge from many spatio-temporal analyses.This also presents a method for the integration of spatial data warehousing andimage mining for integrating various sources of information at different levels.

68

Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.8: Time-series cloud patterns analysis for December 13, 2008

Figure 5.9: Time-series cloud patterns analysis for December 16, 2008

69

5.5. Conclusion

This approach can also be effectively implemented for spatio-temporal patternanalysis such as for a specific phenomenon or a land cover type at various geo-graphical levels.

It is important to note that this attribute data can be accessed through thePG SQL interface without any format conversion. A well designed PG databaseschema for vector and image summaries can be a primary source of front-endOLAP (Online Analytical Processing), and visualization applications, for in-stance, ROLLAP and CUBE. OLAP tools provide the most powerful query tech-niques that can be effectively implemented in a knowledge discovery process[56]. The TerraLib script to generate such vector and image data summarieswas written and the code is provided in Appendix D. The script was executedas an ETL (Extraction, Transformation, Loading) to generate summary datafor a data warehouse. The objective here was not to construct a data warehousebut to illustrate a potential use of the attribute space of spatio-temporal imageand vector data.

5.5 Conclusion

The DBMS technology for time-series image databases to develop advanced ap-plications from integrated image and GIS data analysis was investigated. Theimage databases to store and manipulate remote sensing data have enormousscope and provide the most efficient and flexible platform for information fu-sion from various RS and GIS data sources in the database. Further, TerraLibtechnology built on top of the database technology provides rich overlay andintegrated analysis operations that has shifted the RS and GIS analysis fromindividual monolithic systems into a single incorporated and simple system.This approach also provides the tendency to perform spatio-temporal analysison very large repositories of image and vector data of many years. The re-searcher can write a batch programme for a specific research problem incorpo-rating different data types and various spatial, temporal, statistical, and imageanalysis functions as a sequence of steps in a single routine. This routine canthen be executed over the large amount of data in the database and results canbe stored in the database as images, records, or spatio-temporal summaries toserve a wide range of front-end visualization, and analytical applications.

70

Chapter 6

Conclusions andRecommendations

6.1 Conclusions

The first objective of this project was to study and analyse state-of-the-art im-age handling and manipulation facilities from open source libraries to extendthe open source PostgreSQL/PostGIS (PG/PG) DBMS for image support. Post-greSQL has been extended through PostGIS to provide spatial data types andtopological operators for vector data, but it does not support remote sensingimage data. This image support is required to perform integrated image andvector data analysis in space and time with DBMS technology. As listed belowthe requirements of this image support include:

• Efficient image storage in and retrieval from a database.

• Seamless and integrated analysis through overlay of images with vectordata in a database.

• Visualization of overlayed images and vector data obtained from a database.

A comparative study of existing open source libraries was carried out toprovide such image support in PostgreSQL. These libraries were analysed onthe basis of efficient image support, capability to perform hybrid image/vectoranalysis, provision of image processing algorithms, ability to conduct statisticalanalysis, temporal support, flexibility to deal with various spatial data types,extensibility, and availability of visualization interface. The following conclu-sions were drawn.

• Provision of image support by a PG/PG database through extension withcomplex data types and overlay operations for the integrated image/vectoranalysis is far from trivial, and can easily go beyond a MSc thesis scopeand time. For instance, such innovation by the National Institute forSpace Research, Brazil (INPE) in developing the TerraLib version 3.0 re-leased on May, 2004 required an investment of 40 man-years of work.That software package comprises 95,000 lines of C++ code developed by

71

6.1. Conclusions

INPE, and 195,000 lines of code from third-party libraries. INPE has re-cently released version 3.2.1 at the end of 2008. Further, we analyzedmany other individual efforts for such image support, but all these effortswere incomplete and not capable to develop applications for integrated im-age/vector analysis. Therefore, the first approach was adopted: to use theTerraLib library at intermediate level to extend the PostgreSQL/PostGISDBMS with image support and overlay operations. It is found that INPEopen source TerraLib library provides the most efficient solution for im-age handling in PG/PG database so far and has rich functionality for inte-grated spatial analysis.

• Various architectural components and the conceptual model of TerraLibwere analysed at the database, programming interface, and visualizationlevels. For this, a research platform was constructed running the Post-greSQL/PostGIS DBMS, the TerraLib library, and TerraView visualiza-tion interface on a Linux server machine. Both the TerraView visualiza-tion and TerraLib C++ programming interfaces were used to insert andretrieve images in the spatial database along with other datasets. Ter-raLib provides generic iterators for manipulation of spatial data types inthe database. These iterators were extensively studied and applied forimage manipulation in image mining application development. TerraLibconceptual schema for image data storage, retrieval, and manipulationalong with pyramid, tiling, index and compression support was found tobe sufficient in developing advanced database applications on large repos-itories of remote sensing image and GIS data.

• The requirement to provide overlay operations on image and vector datalayers, is fulfilled through overlay functions provided by TL. These func-tions were found to be extremely useful in developing image mining ap-plication scenarios through integrated image/vector analysis on top of theDBMS technology.

• Storing images on the disk as external files is optimal in some cases, forinstance, when the objective is only visualization in case of static web ap-plications. But, the seamless overlay operations on remote sensing andGIS data layers for integrated spatial analysis with adequate image re-trieval performance from very–large spatial databases has an extremelywide scope in many advanced GIS applications development.

• GIS and remote sensing integrated spatial analysis requires that the anal-ysis procedures should be independent of the data structures. Most cur-rent GIS packages follow separate analysis procedures for separate datastructures. The approach of generic programming through implementa-tion of generic design patterns in development of the TerraLib GIS libraryhas been adopted by INPE to make spatial analysis independent of datastructures. Various design patterns adopted by the TerraLib library wereidentified and documented, to develop a GIS database application featur-ing integrated image/vector data analysis. It was found that such a pro-gramming style allows seamless integrated spatial analysis for various

72

Chapter 6. Conclusions and Recommendations

data types provided by a spatial DBMS and allows rapid development ofa GIS application or research prototype for a specific research problem.

The second objective was to develop an advanced application featuring im-age support provided in the open source spatial database PG/PG. In this con-text, an image mining application was developed with integrated analysis overimage and vector data stored in the database. An extensive review study wascarried out to discover tools and techniques for data mining, and effectivelyapplying these in an image mining application development. The statistical,database-oriented, and machine-learning approaches for data mining were stud-ied. These approaches were then applied in developing various image miningapplication scenarios.

The TerraLib and PostgreSQL DBMS technology adopted for hybrid anal-ysis was applied to develop a cloud detection application to mine cloud pat-terns without using any physical parameters such as sea surface temperatureor moisture content in the air. The Spinning Enhanced Visible and Infrared Im-ager (SEVIRI) sensor image data on board of the MSG satellite for larger partof Europe was extracted from the ITC data receiver. An application programmewas written to construct a time-series image database for extracted image datawith the PG/PG DBMS, using the TerraLib conceptual schema. A statisticaldata mining method based on principal components analysis was adopted toextract cloud features for the area of the Netherlands from the time-series im-age data. The Following results were obtained and conclusions were drawn:

• Vector data stored in the database can be used for rapid delineation ofimage features from overlay of image and vector data layers. This isextremely useful in region-based studies, to extract raster statistics forregions guided by GIS data in a remote sensing image analysis process.The image pixel values for the Netherlands were extracted from the MSGtime-series image data from a larger part of Europe using boundary vec-tor data for the Netherlands. The TerraLib overlay functions for intersec-tion of time-series image and vector layers were used in a spatio-temporalanalysis, in which the images in a time-series image database were anal-ysed for cloud cover in space and time. This approach is also useful for aspatio-temporal analysis, in which a geometry is changing over time, forinstance, flooding, etc.

• One of the important advantages of the adopted spatial database approachfor spatio-temporal analysis is the ability to write a batch process incor-porating a large amount of time-series, multi-source data and integratedspatial analysis functions. The overlay, spatio-analytic, statistical, imageanalysis and aggregate functions on time-series image and vector datalayers can be applied in a sequence of steps. Such batch processes forour cloud detection image mining application were written and executedover time-series image data and the results of the overall process werestored as images and attributes in the PG database. This is an importantmethod in detecting transitions in land cover/land use where seasonal or

73

6.2. Recommendations

long time intervals are associated with different data sources in the spa-tial database, for instance, to analyse maize growth at various parts ofEurope during a season needs an integrated analysis using time-seriesimage data for spectral analysis, and vector data for delineating pixels forvarious regions.

• The results of PCA algorithm were stored as PC images for the visualanalysis and PC statistics as database records along with date and timefor the time-series pattern analysis. TerraView interface for the visualanalysis was extended to provide temporal image query support to re-trieve PC images from the database for a specific time interval. The PCstatistics stored as PG database records were retrieved from MS Excel in-terface for the time-series pattern analysis. The decreasing mean valuefor band-1 at a time on the time-axis indicates the presence of cloud coverover the study area for that time. This result was also compared with thevisual analysis of PC images from TerraView interface. The comparison ofresult from two different interfaces was quite satisfactory. This value wasused for space-time annotation of an image as cloudy for the Netherlandsarea at a particular time in our cloud search engine application.

• Database techniques for data mining were adopted in developing a cloudpatterns detection image mining application scenario and the results ofvector aggregate and image analysis functions were stored as attributedata in a database table. This integration of vector data or any non-spatialdata with the results of remote sensing image analysis in the attributespace is extremely useful in constructing knowledge with a spatial datamining process.This also presents a method for the integration of spatial data warehous-ing and image mining for integrating various sources of information atdifferent levels, which is currently an active area of research identifiedin Section 3.3.2. Spatial data warehousing provides spatial data aggre-gation and spatial navigation, whereas, image mining provides discoveryof knowledge through image processing techniques. This approach can beeffectively used for spatio-temporal pattern analysis such as for a specificphenomenon or a land cover type at various geographical levels.

6.2 Recommendations

Some recommendations for further research are as follows:

• Quality of integrated remote sensing and GIS data analysis in terms ofaccuracy assessment of the results needs to be addressed. Both the rasterand vector data depend upon scale factors and previous studies docu-mented in Section 3.3.1 show that scale variation in spatial data can causevariability in results in an integrated spatial analysis. For instance, dur-ing the overlay of many image and vector data layers. The data prepara-tion phase for an integrated analysis needs to carefully consider the spa-tial data scale and acquisition methods, for instance, those determined by

74

Chapter 6. Conclusions and Recommendations

various sensors in case of remotely sensed data. Any subsequent resultshould be documented with accuracy indications and verifications.

• The TerraLib library was used having an intermediate TerraLib driver forPG, to extend the PostgreSQL/PostGIS database with image support. Thisprovided the ability to perform integrated image/vector analysis for imagemining over large amounts of image/vector data in the PG database.

TerraLib also developed drivers for many other databases. This approachof keeping TerraLib library outside the DBMS gives flexibility to workwith various database systems and programming interfaces. An applica-tion developer has more control on data inputs and outputs incorporat-ing various data formats and analysis procedures. A future innovation tobring the TerraLib library into the DBMS without using the intermediatedriver will be useful to obtain the declarativeness and a user-friendly en-vironment such as provided by Oracle GeoRaster. The application devel-oper can then develop algorithms as PG stored procedures, using TerraLibfunctions at the PG SQL interface. These stored procedures can then bemade callable from the PG SQL interface. This will require revising thevoluminous TerraLib library structure and integration at a lower levelwith the PostGIS.

75

6.2. Recommendations

76

Appendix A

Source code for creatingtime-series image database inPG

#include <TeDatabase.h>#include <TePostgreSQL.h>#include<TePostGIS.h>#include <TePGInterface.h>#include <TePGUtils.h>#include <string.h>#include <string>#include <stdio.h>#include <TeInitRasterDecoders.h>#include <TeImportRaster.h>#include <string>#include <map>#include <sys/types.h>#include <dirent.h>#include <errno.h>#include <vector>#include <iostream>int main(){TeDatabase* db;

################################################################################ File system reading for images ################################################################################

string dir;DIR *dp;

77

struct dirent *dirp;vector<string> files = vector<string>();cout << "Enter your image directory: ";cin >> dir;if((dp = opendir(dir.c_str())) == NULL){

cout << "Error: openeing the directory";cout << "Prees Enter\n";getchar();return errno;

}while ((dirp = readdir(dp)) != NULL){

string file(dirp->d_name);if(file == "." || file == "..")continue;files.push_back(string(file));

}

########################################################################## Connecting PostgreSQL\PostGIS database ##############################################################################

db= new TePostGIS();if (!db->connect("172.16.33.170","imran","password","tvdatabase",5432))

{cout << "Error: " << db->errorMessage() << endl << endl;cout << "Press Enter\n";getchar();return 1;

}TeInitRasterDecoders();

############################################################################# Accessing and initializing input image ##############################################################################

for (unsigned int i = 2;i < files.size();i++) {string in = files[i];string input= dir+"/"+in;TeRaster img(input);

if (!img.init()){cout << "Cannot access the input image!" << endl << endl;cout << "Press Enter\n";getchar();

78

Appendix A. Source code for creating time-series image database in PG

return 1;}

##################################################################### Creating image layer and setting projection for layer########################################################################

string layerName = in;

if (db->layerExist(layerName)){db->close();cout << "The database already has an infolayer with this name \"";cout << layerName << "\"!" << endl << endl;cout << "Press Enter\n";getchar();return 1;

}

TeDatum mDatum = TeDatumFactory::make("WGS84");TeProjection* pUTM = new TeLatLong(mDatum);

TeLayer* layer = new TeLayer(layerName, db, pUTM);if (layer->id() <= 0){db->close();cout << "The destination layer could not be created!\n"<< db->errorMessage()<< endl;cout << "Press Enter\n";getchar();return 1;}

##################################################################### Importing the image to the layer in database ################################################################################

if(!TeImportRaster(layer,&img,256,256,TeRasterParams::TeZLib,"",255,true,TeRasterParams::TeExpansible)){db->close();cout << "Fail to import the image\n\n!";cout << "Press Enter\n";getchar();return 1;}else

79

cout << "The image was imported successfully!\n\n";

delete layer;layer = 0; }

##################################################################################### Closing database ############################################################################################

db->close();cout << "\nPress enter...";getchar();return 0; }

80

Appendix B

Source code for image miningapplication scenarioSection 5.2

using namespace std;#include <TeDatabase.h>

#include <TePostgreSQL.h>#include <TePostGIS.h>#include <TePGInterface.h>#include <TePGUtils.h>#include <string.h>#include <string>#include <stdio.h>#include <TeRaster.h>#include <TeInitRasterDecoders.h>#include <TeImportRaster.h>#include <TePDIPrincipalComponents.hpp>#include <TePDIParameters.hpp>#include <TeAgnostic.h>#include <TePDIUtils.hpp>#include <TeProgress.h>#include <TeStdIOProgress.h>#include <TePDIExamplesBase.hpp>#include <TePDIAlgorithmFactory.hpp>#include <TeInitRasterDecoders.h>#include <TePDIMallatWavelets.hpp>#include <TePrecision.h>#include <TeSpatialOperations.h>

int main() {TEAGN_LOGMSG( "Process started." );

81

try{TeStdIOProgress pi;TeProgress::setProgressInterf( dynamic_cast< TeProgressBase* >( &pi) );TeInitRasterDecoders();

###################################################################Connecting PostgreSQL\PostGIS database #############################################################################

db= new TePostGIS();if(!db->connect("172.16.33.170","imran","password","test",5432)){cout << "Error: " << db->errorMessage() << endl << endl;cout << "Press Enter\n";getchar();return 1;}

###################################################################### Getting Portal to query images from database #########################################################################

TeDatabasePortal* portal = db->getPortal();

if (!portal)return -1;string sql = "SELECT layer_id, name from te_layer WHERE ";sql += " layer_id in (SELECT layer_id FROM te_representation";sql += " WHERE geom_type=512)";

while (portal->fetchRow()){

unsigned int i=0;

string layer_name= portal->getData("name");

###################################################################### Getting GIS vector data to select features from image####################################################################

TeLayer* layer = new TeLayer("arcmap", db);TeProjection* geomProj = layer->projection();

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS);

82

Appendix B. Source code for image mining application scenario Section 5.2

if (!rep){cout << "Layer has no polygons representation!" << endl;db->close();cout << "Press Enter\n";getchar();return 1;}

string geomTableName = rep->tableName_;

TePolygonSet ps;db->loadPolygonSet(geomTableName, "0", ps);

#################################################################Clipping image data by vector data and handling it into memory#################################################################

TeLayer* imgLayer = new TeLayer(layer_name,db);if (imgLayer->id() < 1){cout << "Cannot access image layer" << endl;cout << endl << "Press Enter\n";getchar();return 1;}

string rasterTable = imgLayer->tableName(TeRASTER);int layerId = atoi(portal->getData("layer_id"));TeRaster* img = db->loadLayerRaster(layerId);

TeRaster* clip = TeRasterClipping (img , ps , geomProj,"clip", 0.0, "DB");

if(!clip->init()){cout << "clip was not initialized" << end1;cout << "Press Enter\n";getchar();return 1;

}

#########################################################################Applying principal components algorithm developed by INPE########################################################################

83

TePDIPrincipalComponents::TePDIPCAType analysis_type =TePDIPrincipalComponents::TePDIPCADirect;

TePDIParameters params_direct;

TePDITypes::TePDIRasterPtrType inRaster1(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster1->init(),"Unable to init inRaster1" );

TePDITypes::TePDIRasterPtrType inRaster2(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster2->init(),"Unable to init inRaster1" );

TePDITypes::TePDIRasterPtrType inRaster3(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster3->init(),"Unable to init inRaster1" );

TePDITypes::TePDIRasterVectorType input_rasters;input_rasters.push_back(inRaster1);input_rasters.push_back(inRaster2);input_rasters.push_back(inRaster3);

std::vector<int> bands_direct;bands_direct.push_back(0);bands_direct.push_back(1);bands_direct.push_back(2);

TePDITypes::TePDIRasterPtrType outRaster1_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster1_direct, 1, 1, 1, false, TeDOUBLE, 0),"RAM Raster 1 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster2_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster2_direct, 1, 1, 1, false, TeDOUBLE, 0),"RAM Raster 2 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster3_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster3_direct, 1, 1, 1, false, TeDOUBLE, 0 ),"RAM Raster 3 Alloc error");

TePDITypes::TePDIRasterVectorType output_rasters_direct;output_rasters_direct.push_back(outRaster1_direct);output_rasters_direct.push_back(outRaster2_direct);output_rasters_direct.push_back(outRaster3_direct);

84

Appendix B. Source code for image mining application scenario Section 5.2

TeSharedPtr<TeMatrix> covariance_matrix(new TeMatrix);

params_direct.SetParameter("analysis_type", analysis_type);params_direct.SetParameter("input_rasters", input_rasters);params_direct.SetParameter("bands", bands_direct);params_direct.SetParameter("output_rasters",output_rasters_direct);

params_direct.SetParameter("covariance_matrix",covariance_matrix);

TePDIPrincipalComponents pc_direct;TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct), "InvalidParameters");TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error");TePDIPrincipalComponents pc_direct;TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct), "InvalidParameters");TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error");

####################################################################writing output principal components back into PostgreSQL database####################################################################

TeImportRaster( layer_name, output_rasters_direct[0].nakedPointer(),db );TeImportRaster( layer_name, output_rasters_direct[1].nakedPointer(),db );TeImportRaster( layer_name, output_rasters_direct[2].nakedPointer(),db );++i;}

}catch( const TeException& e ){TEAGN_LOGERR( "Test Failed - " + e.message() );return EXIT_FAILURE;}

TEAGN_LOGMSG( "Test OK." );return EXIT_SUCCESS;

db->close();cout << "\nPress enter...";getchar();return 0;

}

85

86

Appendix C

Source code for image miningapplication scenarioSection 5.3

#include <TePostGIS.h>#include <TePostgreSQL.h>#include <TePGInterface.h>#include <TePGUtils.h>#include <string.h>#include <string>#include <stdio.h>#include <libpq-fe.h>

int main(){string host;string dbname;string user="imran";string password;int port;string init_date;string fin_date;

cout << "Enter intial date in YYYYMMDDHH24MISS format: ";cin >> init_date;

cout << "Enter final date in YYYYMMDDHH24MISS format: ";cin >> fin_date;

TeDatabase* db= new TePostGIS();if(!db->connect("172.16.33.170","imran","password","test",5432))

87

{cout << "Error: " << db->errorMessage() << endl << endl;cout << "Press Enter\n";getchar();return 1;}

TeDatabasePortal* portal = db->getPortal();

if (!portal)return -1;

####################################################################Querying the image database for images between required datetime#####################################################################

string sql = "SELECT l.name as name, to_char(r.initial_time, ";sql+=" ’YYYYMMDDHH24MISS’) as inidatetime, ";sql+=" to_char(r.final_time, ’YYYYMMDDHH24MISS’) as findatetime";sql+=" FROM te_layer l, (SELECT layer_id, initial_time, final_time";sql+=" FROM te_representation WHERE geom_type=512 and initial_time";sql += " between to_timestamp(’"+init_date+"’,’YYYYMMDDHH24MISS’) ";sql += " AND to_timestamp(’"+fin_date+"’,’YYYYMMDDHH24MISS’)) r ";sql += " WHERE l.layer_id=r.layer_id order by r.initial_time desc";

string viewName= init_date+"To"+fin_date;

######################################################################Extending TerraView with images retrieved through views and themes######################################################################

string viewName= init_date+"To"+fin_date;

TeView* view = new TeView(viewName, user);TeProjection* proj = new TeNoProjection();view->projection(proj);

if (!db->insertView(view)){cout << "Fail to insert the view into the database: "<< db->errorMessage() << endl;db->close();delete db;cout << endl << "Press Enter\n";getchar();return 1;}

88

Appendix C. Source code for image mining application scenario Section 5.3

if (!portal->query(sql)){cout << "Could not execute..." << db->errorMessage();delete portal;return -1;}

while (portal->fetchRow()){unsigned int it=0;string layer_name= portal->getData("name");string inidatetime= portal->getData("inidatetime");string findatetime= portal->getData("findatetime");

TeLayer* imgLayer = new TeLayer(layer_name,db);view->projection(imgLayer->projection());

if (imgLayer->id() < 1){cout << "Cannot access image layer" << endl;cout << endl << "Press Enter\n";getchar();return 1;

}

TeTheme* theme = new TeTheme(layer_name, imgLayer);theme->visibleRep(TeRASTER | 0x40000000);view->add(theme);

if (!theme->save()){cout << "Fail to save the theme in the database: "<< db->errorMessage() << endl;db->close();delete db;cout << endl << "Press Enter\n";getchar();return 1;

}

TeGrouping* group1 = new TeGrouping();

++it;}

db->close();

89

cout << "\nPress enter...";getchar();return 0;}

90

Appendix D

Source code for image miningapplication scenarioSection 5.4

using namespace std;#include <TeTable.h>#include <TeDatabase.h>#include <TePostgreSQL.h>#include <TePostGIS.h>#include <TePGInterface.h>#include <TePGUtils.h>#include <string.h>#include <string>#include <stdio.h>#include <TePDIExamplesBase.hpp>#include <TePDIStatistic.hpp>#include <TePDIPrincipalComponents.hpp>#include <TePDIParameters.hpp>#include <TeAgnostic.h>#include <TePDIUtils.hpp>#include <TeInitRasterDecoders.h>#include <TeImportRaster.h>#include <TeRaster.h>#include <TeInitRasterDecoders.h>#include <TeProgress.h>#include <TeStdIOProgress.h>#include <TePrecision.h>#include <TeGeometry.h>#include <TeBox.h>TeDatabase* db;int i,j;

91

################################################################################## Method to generate vector summaries #####################################################################################

int generate_vector_summary(){

TeAttributeList attListVec;string tableNameVec = "vector_summary_data";

db->deleteTable(tableNameVec);

TeAttribute atid;atid.rep_.name_= "study_area_id";atid.rep_.type_ = TeREAL;atid.rep_.numChar_ = 3;TeAttribute atname;atname.rep_.name_= "study_area_name";atname.rep_.type_ = TeSTRING;atname.rep_.numChar_ = 30;TeAttribute atgeom;atgeom.rep_.name_= "geom_id";atgeom.rep_.type_ = TeREAL;atgeom.rep_.numChar_ = 30;atgeom.rep_.decimals_ = 4;TeAttribute atarea;atarea.rep_.name_= "geom_area";atarea.rep_.type_ = TeREAL;atarea.rep_.numChar_ = 30;atarea.rep_.decimals_ = 4;

attListVec.push_back(atid);attListVec.push_back(atname);attListVec.push_back(atgeom);attListVec.push_back(atarea);

if (!db->createTable(tableNameVec, attListVec)){cout << "Fail to create the table: "<< db->errorMessage() << endl;db->close();cout << endl << "Press Enter\n";getchar();return 1;}

TeLayer* layer = new TeLayer("arcmap", db);

92

Appendix D. Source code for image mining application scenario Section 5.4

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS);string geomTableName = rep->tableName_;

string objId = "0";string q;TeDatabasePortal* portalgeom=db->getPortal();q = "SELECT * FROM "+ geomTableName;q+= " WHERE object_id = ’" + objId + "’";

if(!portalgeom->query(q)){cout << "geometry of the mask was not fetched" << endl;delete portalgeom;cout << "Press Enter\n";getchar();return 1;}

TePolygon poly;portalgeom->fetchGeometry(poly);string sumGeom = Te2String(TeGeometryArea(poly));delete portalgeom;

string attrTableName = "arcmap";string qr;TeDatabasePortal* portalattr=db->getPortal();qr = "SELECT nation, cntryname, object_id_159 FROM "+ attrTableName;

if(!portalattr->query(qr)){cout << " could not get attributes of study area" << endl;delete portalattr;cout << "Press Enter\n";getchar();return 1;

}while (portalattr->fetchRow())

{

unsigned int itt=0;

string id = portalattr->getData("nation");string name = portalattr->getData("cntryname");string geom_id = portalattr->getData("object_id_159");

TeTable table(tableNameVec, attListVec, "");

93

TeTableRow row;row.push_back(id);row.push_back(name);row.push_back(geom_id);row.push_back(sumGeom);row.push_back("object1");

table.add(row);if (!db->insertTable(table)){cout << "Fail to save the table: "<< db->errorMessage()<< endl;

db->close();cout << endl << "Press Enter\n";getchar();return 1;}

++itt;}delete portalattr;}

################################################################################## Method to generate image summaries #####################################################################################

int generate_raster_summary() {

TeAttributeList attList;string tableName = "image_summary_data";

db->deleteTable(tableName);

TeAttribute at;at.rep_.name_= "name";at.rep_.type_ = TeSTRING;at.rep_.numChar_ = 250;TeAttribute atrr;atrr.rep_.name_= "study_area_id";atrr.rep_.type_ = TeREAL;atrr.rep_.numChar_ = 3;TeAttribute atrb;atrb.rep_.name_= "band_id";atrb.rep_.type_ = TeREAL;atrb.rep_.numChar_ = 2;TeAttribute atr;atr.rep_.name_= "image_datetime";

94

Appendix D. Source code for image mining application scenario Section 5.4

atr.rep_.type_ = TeSTRING;atr.rep_.numChar_ = 30;TeAttribute atr1;atr1.rep_.name_= "sum";atr1.rep_.type_ = TeREAL;atr1.rep_.numChar_ = 15;atr1.rep_.decimals_ = 4;TeAttribute atr2;atr2.rep_.name_= "mean";atr2.rep_.type_ = TeREAL;atr2.rep_.numChar_ = 15;atr2.rep_.decimals_ = 4;TeAttribute atr3;atr3.rep_.name_= "Entropy";atr3.rep_.type_ = TeREAL;atr3.rep_.numChar_ = 15;atr3.rep_.decimals_ = 4;TeAttribute atr4;atr4.rep_.name_= "Correlation";atr4.rep_.type_ = TeREAL;atr4.rep_.numChar_ = 30;atr4.rep_.decimals_ = 4;TeAttribute atr5;atr5.rep_.name_= "Min";atr5.rep_.type_ = TeREAL;atr5.rep_.numChar_ = 15;atr5.rep_.decimals_ = 4;TeAttribute atr6;atr6.rep_.name_= "Max";atr6.rep_.type_ = TeREAL;atr6.rep_.numChar_ = 15;atr6.rep_.decimals_ = 4;TeAttribute atr7;atr7.rep_.name_= "variance";atr7.rep_.type_ = TeREAL;atr7.rep_.numChar_ = 15;atr7.rep_.decimals_ = 4;TeAttribute atr8;atr8.rep_.name_= "StdDev";atr8.rep_.type_ = TeREAL;atr8.rep_.numChar_ = 15;atr8.rep_.decimals_ = 4;TeAttribute atr9;atr9.rep_.name_= "Mode";atr9.rep_.type_ = TeREAL;atr9.rep_.numChar_ = 15;atr9.rep_.decimals_ = 4;

95

attList.push_back(at);attList.push_back(atrr);attList.push_back(atrb);attList.push_back(atr);attList.push_back(atr1);attList.push_back(atr2);attList.push_back(atr3);attList.push_back(atr4);attList.push_back(atr5);attList.push_back(atr6);attList.push_back(atr7);attList.push_back(atr8);attList.push_back(atr9);

if (!db->createTable(tableName, attList)){cout << "Fail to create the table: " << db->errorMessage() << endl;db->close();cout << endl << "Press Enter\n";getchar();return 1;}

TeDatabasePortal* portal = db->getPortal();

if (!portal)return -1;

string sql = "SELECT l.layer_id as layer_id, l.name as name, ";sql += " to_char(r.initial_time, ’YYYYMMDDHH24MISS’) as ";sql += " inidatetime, to_char(r.final_time, ’YYYYMMDDHH24MISS’)";sql += " as findatetime from te_layer l, (SELECT layer_id, ";sql += " initial_time, final_time FROM te_representation ";sql += " WHERE geom_type=512) as r where l.layer_id=r.layer_id ";sql += " and r.initial_time is not null order by r.initial_time desc";

if (!portal->query(sql)){cout << "Could not execute..." << db->errorMessage();delete portal;return -1;}

while (portal->fetchRow()){unsigned int it=0;

96

Appendix D. Source code for image mining application scenario Section 5.4

string layer_name = portal->getData("name");

string inidatetime= portal->getData("inidatetime");

TeLayer* layer = new TeLayer("arcmap", db);TeProjection* geomProj = layer->projection();

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS);if (!rep){cout << "Layer has no polygons representation!";db->close();cout << "Press Enter\n";getchar();return 1;

}

string geomTableName = rep->tableName_;

TePolygonSet ps;db->loadPolygonSet(geomTableName, "0", ps);

TeLayer* imgLayer = new TeLayer(layer_name,db);if (imgLayer->id() < 1){cout << "Cannot access image layer" << endl;cout << endl << "Press Enter\n";getchar();return 1;}

string rasterTable = imgLayer->tableName(TeRASTER);int layerId = atoi(portal->getData("layer_id"));TeRaster* img = db->loadLayerRaster(layerId);

TeRaster* clip = TeRasterClipping (img , ps , geomProj,"clip", 0.0, "DB");

if(!clip->init()){cout << "clip was not initialized";cout << "Press Enter\n";getchar();return 1;

}

####################################################################

97

#####Applying principal components algorithm developed by INPE########################################################################

TePDIPrincipalComponents::TePDIPCAType analysis_type =TePDIPrincipalComponents::TePDIPCADirect;

TePDIParameters params_direct;

TePDITypes::TePDIRasterPtrType inRaster1(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster1->init(),"Unable to init inRaster1" );TePDITypes::TePDIRasterPtrType inRaster2(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster2->init(),"Unable to init inRaster2" );TePDITypes::TePDIRasterPtrType inRaster3(clip,’r’);TEAGN_TRUE_OR_THROW( inRaster3->init(),"Unable to init inRaster3" );

TePDITypes::TePDIRasterVectorType input_rasters;input_rasters.push_back(inRaster1);input_rasters.push_back(inRaster2);input_rasters.push_back(inRaster3);

std::vector<int> bands_direct;bands_direct.push_back(0);bands_direct.push_back(1);bands_direct.push_back(2);

TePDITypes::TePDIRasterPtrType outRaster1_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster1_direct, 1, 1, 1, false, TeDOUBLE, 0),"RAM Raster 1 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster2_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster2_direct, 1, 1, 1, false, TeDOUBLE, 0),"RAM Raster 2 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster3_direct;TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster(outRaster3_direct, 1, 1, 1, false, TeDOUBLE, 0 ),"RAM Raster 3 Alloc error");

TePDITypes::TePDIRasterVectorType output_rasters_direct;

output_rasters_direct.push_back(outRaster1_direct);output_rasters_direct.push_back(outRaster2_direct);

98

Appendix D. Source code for image mining application scenario Section 5.4

output_rasters_direct.push_back(outRaster3_direct);

TeSharedPtr<TeMatrix> covariance_matrix(new TeMatrix);

params_direct.SetParameter("analysis_type", analysis_type);params_direct.SetParameter("input_rasters", input_rasters);params_direct.SetParameter("bands", bands_direct);params_direct.SetParameter("output_rasters", output_rasters_direct);params_direct.SetParameter("covariance_matrix", covariance_matrix);

TePDIPrincipalComponents pc_direct;TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct),"Invalid Parameters");TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error");

#####################################################################output od PCA algorithm is provided for statistical calculation########################################################################

for (int iter=0; iter < output_rasters_direct.size(); iter++){TePDITypes::TePDIRasterPtrType inRaster = output_rasters_direct[iter];TEAGN_TRUE_OR_THROW( inRaster->init(), "Unable to init inRaster" );

TePDIParameters pars;TePDITypes::TePDIRasterVectorType rasters;rasters.push_back( inRaster );pars.SetParameter( "rasters", rasters );std::vector< int > bands;bands.push_back( 0 );pars.SetParameter( "bands", bands );

TeBox box = inRaster->params().boundingBox();TePolygon pol = polygonFromBox( box );TePDITypes::TePDIPolygonSetPtrType polset( new TePolygonSet );polset->add( pol );pars.SetParameter( "polygonset", polset );

TePDIStatistic stat;

TEAGN_TRUE_OR_THROW( stat.Reset( pars ), "Reset error" );

string band = Te2String(iter);string sum = Te2String(stat.getSum( 0 ));string mean = Te2String(stat.getMean( 0 ));string variance = Te2String(stat.getVariance( 0 ));

99

string StdDev = Te2String(stat.getStdDev( 0 ));string getEntropy = Te2String(stat.getEntropy( 0 ));string getMin = Te2String(stat.getMin( 0 ));string getMax = Te2String(stat.getMax( 0 ));string getMode = Te2String(stat.getMode( 0 ));string getCorrelation = Te2String( stat.getCorrelation( 0, 0 ));

TeTable table(tableName, attList, "");

TeTableRow row;row.push_back(layer_name);row.push_back("31");row.push_back(band);row.push_back(inidatetime);row.push_back(sum);row.push_back(mean);row.push_back(getEntropy);row.push_back(getCorrelation);row.push_back(getMin);row.push_back(getMax);row.push_back(variance);row.push_back(StdDev);row.push_back(getMode);

row.push_back("object1");table.add(row);if (!db->insertTable(table)){cout <<"Fail to save the table: "<< db->errorMessage()<< endl;db->close();cout << endl << "Press Enter\n";getchar();return 1;}}++it;}}

int main() {TEAGN_LOGMSG( "Test started." );try{TeStdIOProgress pi;TeProgress::setProgressInterf( dynamic_cast< TeProgressBase* >( &pi) );TeInitRasterDecoders();

100

Appendix D. Source code for image mining application scenario Section 5.4

db= new TePostGIS();if(!db->connect("172.16.33.170","imran","password","test",5432)){cout << "Error: " << db->errorMessage() << endl << endl;cout << "Press Enter\n";getchar();return 1;

}

generate_vector_summary();generate_raster_summary();

}catch( const TeException& e ){TEAGN_LOGERR( "Test Failed - " + e.message() );return EXIT_FAILURE;}db->close();cout << "\nPress enter...";getchar();return 0;TEAGN_LOGMSG( "Test OK." );return EXIT_SUCCESS;}

101

102

Bibliography

[1] M. Egenhofer, Spatial Information Appliances: A next generation of geo-graphic information systems, First Brazilian Workshop on Geoinformatics,1999, Campinas, Brazil.

[2] Selim Aksoy, Krzysztof Koperski, Carsten Tusk, and Giovanni Marchisio,Interactive Training of Advanced Classifiers for Mining Remote SensingImage Archives, In Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, 2004, 773–782, Seat-tle, WA, USA.

[3] Lubia Vinhas, Richardo Cartexo Modesto De Souza, and Gilberto Camara,Image data handling in spatial databases, V Brazilian Symposium onGeoinformatics, GeoInfo2003, 2003, Campos do Jordao, SP, Brazil.

[4] Vittorio Castelli, and Lawrence D. Bergman, Image Databases: Searchand Retrieval of Digital Imagery, John Wily & Sons Inc., 2002, New York,USA.

[5] N.M. Mattikalli, Integration of remotely-sensed raster data with vector-based geographical information system for land-use change detection, Inthe International Journal of Remote Sensing, 1995, 16, 15, 2813–2828.

[6] C.C. Petit, and E.F. Lambin, Integration of multi-source remote sensingdata for land cover change detection, In the International Journal of Geo-graphical Information Science, 2001, 15, 785–803.

[7] Alexander S. Perepechko, Jessica K. Graybill, Craig ZumBrunnen, andDmitry Sharkov, Spatial database development for Russian urban areas:A new conceptual framework, In the Journal of GIScience & Remote Sens-ing, 2005, 42, 2, 144–170.

[8] X. Yang, and C.P. Lo, Using a time series of satellite imagery to detect landuse and land cover changes in the Atlanta, Georgia metropolitan area, Inthe International Journal of Remote Sensing, 2002, 23, 1775–1798.

[9] S. Gautama, J.D. Haeyer, and W. Philips, Graph-based change detection ingeographic information using VHR satellite images, In the InternationalJournal of Remote Sensing, 2006, 27, 9, 1809–8124.

103

Bibliography

[10] Qihao Weng, Land use change analysis in the Zhujiang Delta of China us-ing satellite remote sensing, GIS and stochastic modelling, In the Journalof Environmental Management, 2002, 64, 273–284.

[11] D. Lu, P. Mausel, E. Brondizio, and E. Moran, Change detection tech-niques, In the International Journal of Remote Sensing, 2004, 25, 12,2365–2407.

[12] Jignesh Patel, JieBing Yu, Navin Kabra, Kristin Tufte, Biswadeep Nag,Josef Burger, Nancy Hall, Karthikeyan Ramasamy, Roger Lueder, CurtEllmann, Jim Kupsch, Shelly Guo, Johan Larson, David DeWitt, and Jef-frey Naughton, Building a Scalable Geo-Spatial DBMS: Technology, Im-plementation, and Evaluation, In Proceedings of the ACM SIGMOD Con-ference, 1997, 336–347.

[13] Elena G. Irwin, Nancy E. Bockstael, and Hyun Jin Cho, Measuring andmodeling urban sprawl: Data, scale and spatial dependencies, In the Ur-ban Economics Sessions, 53rd Annual North American Regional ScienceAssociation Meetings of the Regional Science Association International,November 16-18, 2006, Toronto, Canada.

[14] Jiawei Han, Krzysztof Koperski, and Nebojsa Stefanovic, GeoMiner: ASystem Prototype for Spatial Data Mining, In Proceedings of the ACMSIGMOD Conference, 1997, 553–556.

[15] University of Alabama in Huntsville, ADaM 4.0.0 Documentation,http://datamining.itsc.uah.edu/adam/documentation.html, Accessed onSeptember 14, 2008.

[16] Marcelino Pereira Dos Santos Silva, and Gilberto Camara, Remote Sens-ing Image Mining Using Ontologies, Technical Report, DPI/INPE- ImageProcessing Division, National Institute of Space Research, 2005, Brazil.

[17] MapServer documentation manuals, http://mapserver.gis.umn.edu/, Ac-cessed on September 14, 2008.

[18] Oracle Spatial Documentation, http://www.oracle.com/, Accessed onSeptember 14, 2008.

[19] Open Geospatial Consortium, OpenGIS implementation specification:Grid coverage, Technical report, Open Geospatial Consortium, 2001.

[20] Xing Lin, and Timothy H. Keitt, Goeraster a coverage/raster modeland operations for PostGIS, Google Summer of Code Project, 2007,http://lists.refractions.net/, Accessed on October 04, 2008.

[21] Shashi Shekhar, and Sanjay Chawala, Spatial Databases: A Tour, PearsonEducation Inc., 2003, New Jersey, USA.

[22] Vittorio Castelli, and Lawrence D. Bergman, Image Databases: Searchand Retrieval of Digital Imagery, John Wily & Sons Inc., 2002, New York,USA.

104

Bibliography

[23] Liu Yu, Wang Yinghui, Zhang Yi, Lin Xing, and Qin Shi, GSQL-R: A querylanguage supporting raster data, In the Geoscience and Remote SensingSymposium IGARSS ’04’, 2004, 7, 4414–4417.

[24] PgRaster SQL interface requirement document for raster data,http://postgis.refractions.net/, Accessed on October 11, 2008.

[25] PostgreSQL documentation of manuals, http://www.postgresql.org/, Ac-cessed on October 11, 2008.

[26] PGCHIP documentation of manuals, http://simon.benjamin.free.fr/pgchip/,Accessed on October 11, 2008.

[27] Oracle Spatial Documentation, http://www.oracle.com/, Accessed on Oc-tober 11, 2008.

[28] Gilberto Camara, Lubia Vinhas, Karine Reis Ferreira1, Gilberto Ribeiro deQueiroz, Ricardo Cartaxo Modesto de Souza, Antonio Miguel Vieira Mon-teiro, Marcelo Tılio de Carvalho, Marco Antonio Casanova, and UbirajaraMoura de Freitas, TerraLib: An Open Source GIS Library for Large-ScaleEnvironmental and Socio-economic Applications, In the book G. Brent Halland Michael G. Leahy (edt), Open Source Approaches in Spatial Data Han-dling, Springer, 2008, 247–270, Berlin.

[29] TerraLib programming tutorial, http://www.dpi.inpe.br/terralib/docs/,Accessed on October 14, 2008.

[30] Gilberto Camara, Marcos Correa Neves, Antonio Miguel Vieira Monteiroand Lubia Vinhas, Spring and TerraLib: Integrating Spatial Analysis andGIS, Technical Report, DPI/INPE- Image Processing Division, NationalInstitute of Space Research, 2002, Campos do Jordao, SP, Brazil.

[31] Lubia Vinhas, Gilberto Camara, and Ricardo Cartaxo Modesto de Souza,TerraLib: An open source GIS library for spatio-temporal databases, Tech-nical Report, DPI/INPE- Image Processing Division, National Institute ofSpace Research, 2004, Brazil.

[32] Paul Ramsey, The state of open source GIS, Technical Report, RefractionsResearch Inc., 2007.

[33] GDAL – geospatial data abstraction library, http://www.gdal.org/, Ac-cessed on October 17, 2008.

[34] J.A Greenberg, C.A Rueda, and S.L Ustin, Starspan: A tool for fast selec-tive pixel extraction from remotely sensed data, Center for Spatial Tech-nologies and Remote Sensing (CSTARS), University of California at Davis,2005, Davis, CA.

[35] Starspan documentation manuals, http://starspan.casil.ucdavis.edu/,Accessed on October 17, 2008.

105

Bibliography

[36] Renato Martins Assuncao, Marcos Correa Neves, Gilberto Camara, andCorina Da Costa Freitas, Efficient regionalization techniques for socio-economic geographical units using minimum spaning trees, In the Inter-national Journal of Geographical Information Science, 2006, 20, 797–812.

[37] Pedro Ribeiro de Andrade Neto, and Paulo Justiniano Ribeiro Junior, AProcess and Environment for Embedding The R Software into TerraLib,In the VII Brazilian Symposium on Geoinformatics, GeoInfo2005, 2005,Campos do Jordao, SP, Brazil.

[38] John Rushing, Rahul Ramachandran, Udaysankar Nair, Sara Graves, RonWetch, and Hong Lin, ADaM: a data mining toolkit for scientists andengineers, In the Computers & Geosciences, 2005, 31, 607–618.

[39] Lubia Vinhas, Gilberto Camara, and Ricardo Cartaxo Modesto de Souza,TerraLib: An open source GIS library for spatio-temporal databases, Tech-nical Report, DPI/INPE- Image Processing Division, National Institute ofSpace Research, 2004, Brazil.

[40] A.M. MacEachren, An evolving cognitive-semiotic approach to geographicvisualization and knowledge construction, In the Information Design Jour-nal, 2001, 10, 49–72.

[41] David J. Hand, Statistics and Data Mining: Intersecting Disciplines, Inthe SIGKDD Explanations, 1999, 1, 1, 16–19.

[42] S. Sumathi, and S.N. Sivanandam, Statistical Themes and Lessons forData mining, In the Studies in Computational Intelligence (SCI), 2006,29, 243–263.

[43] Surajit Chaudhuri, Data Mining and Database Systems: Where is theIntersection?, In the IEEE Data Eng. Bull., 1998, 21, 1, 4–8.

[44] A.K. Sinha, Geoinformatics: Data to knowledge. The Geological Society ofAmerica (GSA), 2006, Boulder, Colorado, USA.

[45] David M. Rocke, and David L. Woodruff, Some Statistical Tools for DataMining Applications, Center for Image Processing and Integrated Com-puting University of California, 1998, Davis, USA.

[46] Fredrik Farnstrom, James Lewis, and Charles Elkan, Scalability for clus-tering algorithms revisited, In the SIGKDD Explorations, 2000, 2, 7–51.

[47] M.O. Mansur, and Mohd. Noor Md. Sap, Outlier Detection Technique inData Mining: A Research Perspective, In the Postgraduate Annual Re-search Seminar, 2005, Brazil.

[48] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and JoorgSander, LOF: Identifying Density-Based Local Outliers, In the Proceed-ings of ACM SIGMOD 2000 Int. Conf. on Management of Data, 2000, 29,2, 93–104.

106

Bibliography

[49] Tu Bao Ho, Saori Kawasaki1, and Janusz Granat, Knowledge Acquisitionby Machine Learning and Data Mining, In the Studies in ComputationalIntelligence, Springer, 2007, 59, 69–91.

[50] Pauray S.M. Tsai, and Chien-Ming Chen, Mining interesting associationrules from customer databases and transaction databases. In the Informa-tion Systems, 2004, 29, 8, 685–696.

[51] S. Sumathi, and S.N. Sivanandam, Data Mining Tasks, Techniques, andApplications, In the Studies in Computational Intelligence (SCI), 2006, 29,195–216.

[52] Boriana L. Milenova, and Marcos M. Campos, Mining high-DimensionalData for International Fusion: A Database-Centric Approach, In the 8thInternational Conference on Information Fusion, 2005, 1, 7 pp-.

[53] Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, and Joorg Sander,Spatial Data Mining: Database Primitives, Algorithms and EfficientDBMS Support, In the Data Mining and Knowledge Discovery, 2000, 4,193–216.

[54] Ming-Syan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An overviewfrom database prospective, In the IEEE Transactions on Knowledge andData Engineering, 1996, 8, 886–883.

[55] Surajit Chaudhuri, and Umeshwar Dayal, An overview of data warehous-ing and OLAP technology, In the SIGMOD Record, 1997, 26, 65–74.

[56] P. Adrians, and D. Zantinge, Data Mining, Addison-Wesley, 1996, Harlow,U.K.

[57] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Re-ichart, and Murali Venkatrao, Data Cube: A Relational Aggregation Op-erator Generalizing Group-By, Cross-Tab, and Sub-Totals, In the Datamining and Knowledge Discover, 1997, 1, 29–53.

[58] Monica Wachowicz, GeoInsight: An approach for developing a knowledgeconstruction process based on the integration of GVis and KDD methods,In the book Harvey J.Mliiter, Jiawei Han (Eds.) Geographic data miningand knowledge discovery, Taylor & Frances Inc., 2001, New York, USA.

[59] Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh, Algorithmsfor association rule mining: a general survey and comparison, In the ACMSIGKDD Explorations, 2000, 2, 1, 58–64.

[60] Tu Bao Ho, Saori Kawasaki, and Janusz Granat, Knowledge Acquisitionby Machine Learning and Data Mining, In the Studies in ComputationalIntelligence, Springer, 2007, 59, 69–91.

[61] S. Sumathi, and S.N. Sivanandam, Data Mining Tasks, Techniques, andApplications, In the Studies in Computational Intelligence (SCI), 2006, 29,195-216.

107

Bibliography

[62] Jong Gyu Han, Keun Ho Ryu, Kwang Hoon Chi, and Yeon Kwang Yeon,Statistics Based Predictive Geo-spatial Data Mining: Forest Fire Haz-ardous Area Mapping Application, In the X. Zhou, Y. Zhang, and M.E. Or-lowska (Eds.): APWeb 2003, Springer, 2003, LNCS 2642, 370–381, Berlin,Heidelberg.

[63] Gilberto Camara, Max J. Egenhofer, Frederico Fonseca, and AntonioMiguel Vieira Monteiro, What is an image? In the International Confer-ence on Spatial Information Theory, Springer, 2001, LNCS 2205, 474–488.

[64] Yong Ge, Bai Hexiang, and Sanping Li, Geo-spatial Data Analysis, QualityAssessment, and Visualization, In the Proceedings of International Confer-ence on Computational Science and Applications ICCSA 2008, Springer,2008, 5072, 258-267.

[65] S. Rinzivillo, F. Turini, V. Bogorny, C. Korner, B. Kuijpers, and M. May,Knowledge Discovery from Geographical Data, In the book Mobility, DataMining, and Privacy, Springer, 2008, 243–265, Berlin, Heidelberg.

[66] Vijay Gandhi, James M. Kang, and Shashi Shekhar, Spatial databases:Technical Report, Department of Computer Science and Engineering Uni-versity of Minnesota, 2007, USA.

[67] Tianqiang Huang, and Ziaolin Qin, Detecting Outliers in spatial database,In the International Conference on Image and Graphics, IEEE ComputerSociety, 2004, 556–559.

[68] Elzbieta Malinowski, and Esteban Zimanyi. Spatial data Warehouses.In the book Advanced Data Warehouse Design, Springer, 2008, 133–179,Berlin, Heidelberg.

[69] Sebastien Mustiere, and John Van Smaalen. Database requirementsfor generalization and multiple representation, In the book Generaliza-tion of geographic information: cartographic modelling and applications,Springer, 2007, 113–136, Berlin, Heidelberg.

[70] Elisa Bertino, and Maria Luisa Damiani. Spatial Knowledge-Based Ap-plications and Technologies: Research Issues, In the Proceedings of 9thInternational Conference, KES 2005, Part IV, Springer, 2005, LNCS 3684,324–328, Berlin, Heidelberg.

[71] Krzysztof Koperski, and Jiawei Han, Discovery of spatial associationsrules in geographic information database, In the Proc. 4th Int. Symp.Advances in Spatial Databases, SSD, Springer-Verlag, 1995, LNCS 951,47–66, Berlin, Heidelberg.

[72] Alfred Stein. Modern developments in image mining, In the Science inChina Series E: Technological Sciences, 2008, 51, 13–25.

[73] Ranga Raju Vatsavai, Shashi Sheckar, Thmas E. Burk, and BudhendraBhaduri, *Miner: A Suit of Classifiers for Spatial, Temporal, Ancillary,

108

Bibliography

and Remote Sensing Data Mining, In the Fifth International Conferenceon Information Technology: New Generations, IEEE, 2008, 801–806.

[74] Marcelino Pereira S. Silva, Gilberto Camara, Ricardo Cartaxo M. Souza,Dalton M. Valeriano, and Maria Isabel S. Escada, Mining Patterns ofChange in Remote Sensing Image Databases, In the IEEE InternationalConference on Data mining, IEEE, 2005, 362–369, Los Alamitos, CA, USA.

[75] R.L. Kettig, and D.A. Landgrebe, Computer classification of remotelysensed multispectral image data by extraction and classification of homo-geneous objects, In the IEEE Trans. Geoscience Electronics, 1976, GE-14,1, 19–26.

[76] B. Uma Shankar, Novel Classification and Segmentation Techniques withApplication to Remotely Sensed Images, In J.F. Peters et al. (Eds.): Trans-actions on Rough Sets VII, Springer-Verlag, 2007, LNCS 4400, 295–380,Berlin, Heidelberg.

[77] Wynne Hsu, Mong Li lee, and Ji Zhang. Image Mining: Trends and Devel-opments. In the Journal of Intelligent Information Systems, 2002, 19, 1,7–23.

[78] Gilberto Camara, Ricardo Cartaxo Modesto Souza, Ubirajara Moura Fre-itas, Juan Garrido, and Fernando Mitsuo, SPRING: Integrating remotesensing and GIS by object-oriented data modelling. In the Journal of Com-puters & Graphics, 1996, 20, 3, 395–403.

[79] Gilberto Camara, Lubia Vinhas, Karine Reis Ferreira1, Gilberto Ribeiro deQueiroz, Ricardo Cartaxo Modesto de Souza, Antonio Miguel Vieira Mon-teiro, Marcelo Tılio de Carvalho, Marco Antonio Casanova, and UbirajaraMoura de Freitas, TerraLib: An Open Source GIS Library for Large-ScaleEnvironmental and Socio-economic Applications, In the book G. Brent Halland Michael G. Leahy (Eds.), Open Source Approaches in Spatial DataHandling, Springer, 2008, 247–270, Berlin.

[80] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides, De-sign Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995, NJ.

[81] J. Rogan, and J. Miller, Integrating GIS and remote sensing for mappingforest distribution and change. In the book Understanding forest distribu-tion and spatial pattern: Remote sensing and GIS CRC approaches, CRCPress (Taylor & Francis), 2006, FL, USA.

[82] A.S.M. Gieske, J. Hendrikse, V. Retsios, B. van Leeuwen, B.H.P. Maathuis,M. Romaguera, J.A. Sobrino, W.J. Timmermans, and Z. Su, Processing ofMSG-1 SEVIRI data in the thermal infrared algorithm development withthe use of the SPARC2004 data set, In the ESA Proceedings WPP-250:SPARC final workshop, 2005, 8, Enschede, Netherlands.

109

Bibliography

[83] E. Ebert, A pattern recognition technique for distributing surface andcloud types in the polar regions, In the Journal of Climate and AppliedMeteorology, 1987, 26, 1412–1427.

[84] U. Amato, A. Antoniadis, V. Cuomo, L. Cutillo, M. Franzese, L. Murino, andC. Serio, Statistical cloud detection from SEVIRI multispectral images, Inthe Journal of Remote Sensing of Environment, 2008, 112, 750–766.

[85] Critina Conde, Antonio Ruiz, and Enrique Cabello, PCA Vs low resolu-tion images in face verification, In Proceedings of the 12th InternationalConference on Image Analysis, IEEE Computer Society, 2003, 63–67.

110