a clustering-based visualization of spatial...

12
A clustering-based visualization of spatial patterns Nazha Selmaoui-Folcher, Fr´ ed´ eric Flouvat, Elise Desmier and Dominique Gay University of New Caledonia, PPME-ERIM, F-98851, Noumea, New Caledonia [email protected], [email protected], [email protected], [email protected] March 9, 2010 Abstract Extraction of interesting colocations in geo- referenced data is one of the major tasks in spa- tial pattern mining. Considering a set of spatial Boolean features, the goal is to find relevant sub- sets of features associated with objects often lo- cated together. In this context, the main drawback is the interpretation of extracted patterns by do- main experts. Indeed, common textual representa- tion of colocations loses important spatial informa- tion. To overcome this problem, we propose a new clustering-based visualization technique deeply in- tegrated in the colocation algorithm. This new sim- ple, concise and intuitive cartographic representa- tion consider both spatial information and experts practice. The whole process has been experimented on a real-world geological data set and the added- value of the method confirmed by domain experts. 1 Introduction Spatial data mining refers to the extraction of inter- esting, useful, unexpected and implicit knowledge in spatial data. It has wide applications in environ- mental management, public safety, transportation or tourism. One of the classical task in spatial pat- tern mining is the extraction of interesting coloca- tions in geo-referenced data [14, 20, 3, 12, 16, 4, 21, 6, 7, 17]. To deal with this problem, two families of spatial pattern mining approaches may be iden- tified : multi-relational approaches and colocation- based approaches. When spatial data is made of various tables describing objects and spatial re- lationships between objects, multi-relational data mining techniques may be applied to extract spatial interesting patterns such as association rules [16] or emerging patterns [6] ; see also [3, 17]. On the other hand, [12] identified two approaches for colocation mining: transaction-based approaches and event- based approaches. Transaction-based approaches focus on transforming spatial data into transac- tional data where classical itemset mining algo- rithms could be used [14, 4]. In [14], authors pre- sented an efficient method for mining association rules in geographic information databases. This method enumerates neighbors to ”materialize” a set of transactions around instances of the refer- ence spatial feature. The goal is to find colocations of relevant features to the reference feature. [4] extends this work by introducing knowledge con- straints in a preprocessing step. The main limit of these works is that spatial relationships and fea- tures are only partially considered. To cope with this limit, event-based approaches focus on the event and their neighbor relationships [20, 12, 21]. Shekhar and al. have defined the colocation con- cept based on Koperski’s work. The goal is to find all subsets of spatial features likely to occur together. To filter interesting colocations, two in- terestingness measure have been proposed. Thanks to the anti-monotonic property of these predicates, a levelwise algorithm has been used to extract in- teresting colocations. Thus, this approach consider all the features together and original data are not transformed. However, a major problem with these spatial pat- tern mining techniques is the interpretation of the results by domain experts. Actually, extracted pat- terns are presented in a textual form, which is not 1

Upload: others

Post on 07-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

A clustering-based visualization of spatial patterns

Nazha Selmaoui-Folcher, Frederic Flouvat, Elise Desmier and Dominique Gay

University of New Caledonia, PPME-ERIM, F-98851, Noumea, New Caledonia

[email protected], [email protected],

[email protected], [email protected]

March 9, 2010

Abstract

Extraction of interesting colocations in geo-referenced data is one of the major tasks in spa-tial pattern mining. Considering a set of spatialBoolean features, the goal is to find relevant sub-sets of features associated with objects often lo-cated together. In this context, the main drawbackis the interpretation of extracted patterns by do-main experts. Indeed, common textual representa-tion of colocations loses important spatial informa-tion. To overcome this problem, we propose a newclustering-based visualization technique deeply in-tegrated in the colocation algorithm. This new sim-ple, concise and intuitive cartographic representa-tion consider both spatial information and expertspractice. The whole process has been experimentedon a real-world geological data set and the added-value of the method confirmed by domain experts.

1 Introduction

Spatial data mining refers to the extraction of inter-esting, useful, unexpected and implicit knowledgein spatial data. It has wide applications in environ-mental management, public safety, transportationor tourism. One of the classical task in spatial pat-tern mining is the extraction of interesting coloca-tions in geo-referenced data [14, 20, 3, 12, 16, 4, 21,6, 7, 17]. To deal with this problem, two familiesof spatial pattern mining approaches may be iden-tified : multi-relational approaches and colocation-based approaches. When spatial data is made ofvarious tables describing objects and spatial re-lationships between objects, multi-relational data

mining techniques may be applied to extract spatialinteresting patterns such as association rules [16] oremerging patterns [6] ; see also [3, 17]. On the otherhand, [12] identified two approaches for colocationmining: transaction-based approaches and event-based approaches. Transaction-based approachesfocus on transforming spatial data into transac-tional data where classical itemset mining algo-rithms could be used [14, 4]. In [14], authors pre-sented an efficient method for mining associationrules in geographic information databases. Thismethod enumerates neighbors to ”materialize” aset of transactions around instances of the refer-ence spatial feature. The goal is to find colocationsof relevant features to the reference feature. [4]extends this work by introducing knowledge con-straints in a preprocessing step. The main limitof these works is that spatial relationships and fea-tures are only partially considered. To cope withthis limit, event-based approaches focus on theevent and their neighbor relationships [20, 12, 21].Shekhar and al. have defined the colocation con-cept based on Koperski’s work. The goal is tofind all subsets of spatial features likely to occurtogether. To filter interesting colocations, two in-terestingness measure have been proposed. Thanksto the anti-monotonic property of these predicates,a levelwise algorithm has been used to extract in-teresting colocations. Thus, this approach considerall the features together and original data are nottransformed.

However, a major problem with these spatial pat-tern mining techniques is the interpretation of theresults by domain experts. Actually, extracted pat-terns are presented in a textual form, which is not

1

Page 2: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

a representation that can be easily understood anddirectly usable by experts. Moreover, a textual rep-resentation considers only partially the spatial in-formations of the underlying objects. Indeed, ex-perts can only know which features are generallylocated together, but they don’t have any infor-mations on where these colocations are generallylocated and their configuration. In this context, wepropose a new visualization of colocations based onclustering. This solution leads to a simple, conciseand intuitive cartographic visualization of coloca-tions, and takes into consideration the spatial na-ture of the underlying objects and the experts prac-tice. Finally, this proposition has been integratedin a prototype with a Geographic Information Sys-tems (GIS). Experiments have been done on a realgeological dataset and validated by a domain ex-pert.

Section 2 presents related works on the interpre-tation and the visualization of data mining results.Section 3 presents the colocation mining problem.Section 4 introduces our work to deliver actionableknowledge to domain experts, i.e. a new spatialrepresentation of colocations. Section 5 presentssome experiments on a real geological dataset. Fi-nally, section 6 concludes and gives some perspec-tives.

2 Visualization in data min-

ing: related works

One of the major issues in data mining is the repre-sentation of the discovered knowledge such as it canbe easily understood and directly usable by experts[11]. Nevertheless, most of data mining methodsreturn results in a textual form based on an inter-estingness measure. To our knowledge, no solutionshave been proposed specifically for the visualizationof colocation patterns. However, several visualiza-tion systems have been proposed for classical datamining tasks or for spatial data. In the rest of thissection, we describe the main approaches.

For classical data mining, several systems havebeen developed to represent raw data or miningresults [5, 13]. For example, MineSet [5] is an in-teractive system for data mining integrating datavisualization. Different kinds of visualizer (statis-tics, scatter, map, tree) are available according to

Figure 1: a) A Rule visualizer view of supermar-ket items. b) Visualization of an association rulefor South America. c)A snapshot of the proposedWiFIsViz.

the type of result to visualize. Each mining algo-rithm (such as simple-Bayes model, decision tree,or association rules) is coupled with a visualizationtool in order to help users in their interpretation ofthe learned models. Figure 1-a shows the visualiza-tion of association rules.

Recently, [15] deals with the visualization of fre-quent itemsets. The authors developed a system,WiFIsViz, for visualizing frequent itemsets basedon orthogonal graphs (wiring-type diagrams). Fre-quent itemsets are shown in a two-dimensionalspace, where the x-axis shows items and the y-axisshows the frequencies (figure 1-c). An itemset X isrepresented by a horizontal line connecting nodes,where each node represents an item of X . More-over, itemsets sharing the same prefix are merged,which improves the visualization. The visualizerprovides different levels of details to represent fre-quent itemsets. It also integrates features for con-strained itemset mining.

For spatial data, a typical system is the one pro-posed in[2]. Authors were interested in represent-ing spatial data and providing a visualization ofclassical data mining results on spatial data. The

2

Page 3: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

presentation of subgroups or clusters is naturallypresented on maps using painting or icons on spa-tial objects. The same technique is applied to de-cision trees and classification rules by associating avisual feature to objects. For non-geographical in-formations such as mined trees or rules, the systemmakes a dynamic link between the map and reports.For example, when a cursor is positioned on a treenode or a rule in a report, the corresponding in-stances are highlighted in the map (and vice versa).Figure1-b illustrates an application of this systemfor the visualization of rules for South America.

As far as we know, none of the solutions proposedin the literature were designed to display spatialpatterns in a simple, concise and intuitive way forexperts. They do not take into consideration thespatial nature of the underlying objects, and onlyprovide non spatial knowledge.

3 The colocation mining

framework

This section recall the colocation framework pro-posed in [20, 12, 21]. Let F be a set of booleanfeatures, O be a set of spatial objects, and R bea neighbor relationship over O. An instance of afeature f ∈ F is an object of O having the featuref . We define the function Θ : O → F to formallydefine the association between objects and features.For example, in figure 2, F = A, B, C, D, E,O = A1, C2, B3..., E12, and A9 is an instance offeature A, i.e. Θ(A9) = A. Note that, in this pa-per, spatial objects are represented by points in atwo dimensional space.

Figure 2: Spatial objects and their features

A colocation C ⊆ F is a set of features, whoseinstances form a clique using a neighbor relation-ship R. If the neighbor relationship R is the Eu-

clidean distance with a threshold, two spatial ob-jects are neighbors if their distance is lower thanthe given threshold.

A colocation instance I ⊆ O of a colocationC is a set of objects, such that the objects are in-stances of all the features of C and form a cliquerelationship w.r.t. R. As a consequence, a coloca-tion instance of a colocation C satisfies the followingproperty:• |f ∈ C | o ∈ I and Θ(o) = f| = |C|

• |I| = |C|

• ∀o, p ∈ I,R(o, p) = true

The figure 2 shows that the set of objectsA9, B4, D10 is a colocation instance of the coloca-tion A, B, D w.r.t. to a fixed Euclidean distancethreshold (represented by dotted circles). To theopposite, A1, B4, C7, A1, B4, D10 or A9, B4are not colocation instances of A, B, D. How-ever, not every colocation is interesting. There isonly one set of three neighbor objects having thefeatures A, B and D. Thus, we need other conceptsto determine the interestingness of a colocation. Inthis paper, to simplify, we use the term ”instance”to refer to a ”colocation instance”, and representthe colocation A, B, D by ABD in the figures.

A table instance of a colocation C, denotedTIC, is the set of all its colocation instances.The table instance of A, B, C is TIA,B,C =A1, B8, C7, A5, B6, C2 and the table instanceof B, D is TIB,D = B4, D10 (see figure 2).More formally, we have

TIC = I ⊆ O | I is an instance of C w.r.t. R

The participation ratio pr(C, f) for a featuref of a colocation C, is the fraction of objects of afeature f included in the instances of C, to the totalnumber of objects of a feature f .

pr(C, f) =|o ∈ I | I ∈ TIC and Θ(o) = f|

|TIf|

In figure 2, pr(A, B, C, A) =2/3, pr(A, B, C, B) = 1/2 andpr(A, B, C, C) = 1.

Based on the definitions above, [12] has pro-posed the concept of participation index, de-noted pi(C), to estimate the frequency of coloca-tion C. More precisely, it represents the minimal

3

Page 4: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

probability to have an object in an instance of thecolocation C w.r.t. all objects having this feature.

pi(C) = min∀f∈C

( pr(C, f))

Based on these definitions, the problem to solveis :Colocation mining problem. Given F a set

of features, O a set of spatial objects, R a neighbor

relationship and α ∈ [0, 1] a threshold. The problem

is to find the set of colocations C ⊆ F | pi(C) ≥ α

4 A spatial visualization of

colocations integrated in a

GIS

The visualization of data mining results is essentialto have actionable domain knowledge. In domainsmanipulating geographical data, GIS are classicaltools for storing and visualizing spatial data. Amain characteristic of GIS is the cartographic visu-alization of the information in thematic layers. Inthis context, our objective is to find a spatial visual-ization of the colocation mining results integratedin the GIS. However, the potential high numberof colocation instances may lead to an unreadablemap, and colocations in a textual form loses thespatial informations of their objects.

To deal with these problems, we propose a newcartographic visualization of colocations in a GIS.The principle of our approach (figure 3) is two-step:

a) extract colocations patterns using classicalcolocation mining algorithm

b) use the table instance of each colocation C toconstruct spatial representations of C

These spatial representations allow to see whereand how the colocation is generally located. Basi-cally, a spatial representation of a colocation

C is a set of points, each one representing a fea-ture of C, and linked together by lines. The linesbetween the points represent the neighbor relation-ship. In other words, a spatial representation ofa colocation is a clique spatially positioned on amap. The position of each point of the clique de-pends on the position of the colocation instances.For example, on figure 3, the colocation E, C has

Figure 3: Principle of our approach

two spatial representations, which shows that thiscolocation is generally located in the center and inthe north-east of the area.

Note that our visualization approach also inte-grates thematic aspects by painting each feature inthe color of the corresponding theme. In the sameway, the intensity of the links color is proportionalto the value of the participation index associatedto the colocation.

4.1 A first cartographic representa-

tion of colocations

Firstly, we consider a very simple approach to con-struct spatial representations of colocations. It con-

4

Page 5: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

sists in constructing, for each feature f of a colo-cation C, the centroid of the objects of feature fincluded in an instance of C (figure 4). In otherwords, for each feature f ∈ C, the visual represen-tation of f is a spatial object of such that:

of = (xf , yf ), with xf =

∑∀o=(x,y)∈Ωf,C

x

|Ωf,C |,

yf =

∑∀o=(x,y)∈Ωf,C

y

|Ωf,C |

and Θ(of ) = f

where Ωf,C = o ∈ I | I ∈ TIC and Θ(o) = fAs a consequence, each feature f of a colocation

C is represented by a single spatial object (i.e. apoint) in the map. This object corresponds to the”average” location of feature f in the instances ofC. Thus, each colocation is represented by a sin-gle clique corresponding to the ”average” locationand configuration of its instances. Figure 4 illus-trates the construction of the spatial representa-tions based on the centroid approach (step b of ourapproach illustrated in figure 3).

However, this method leads to an interpreta-tion problem when the spatial representation of thecolocation is located in the middle of the studiedarea. Indeed, instances of such colocations can belocated either in the middle of the area or uniformlydistributed all over the area. Moreover, in prac-tice, instances of a colocation are rarely groupedin a single location. Instead, they may be severallocations where the colocation frequently appears.In such cases, this method will construct an ”aver-age” spatial representation which is not necessarilymeaningful for the expert. Figures 3 and 4 illus-trate this problem with colocation E, C. In fig-ure 3 (top figure), instances of this colocation arefrequently located in the central region and in thenorth-east region. In figure 4 (right figure), its spa-tial representation using the previous method is lo-cated between these two regions which can be mis-interpreted by experts. Nonetheless, note that thisspatial representation may give an interesting infor-mation to experts: such relation generally doesn’toccur in the south or in the east of the studied area.

A solution to deal with this problem is to useclustering in order to have several spatial represen-tations of a colocation w.r.t. the locations of itsobjects, and thus to have a finer interpretation ofthe spatial distribution of colocations.

4.2 A clustering-based spatial repre-

sentation of colocations

When instances of a colocation are not spatially lo-cated in a single location, there should be severalspatial representations for such colocation to rep-resent these different spatial distributions. In thiscontext, our proposition is to combine a clusteringmethod with the colocation mining algorithm (fig-ure 5). More precisely, instead of processing cen-troids based on the whole table instance of a colo-cation C, we partition this table instance in severalclusters based on their spatial coordinates (figure 5step 1.clustering). Then, each partition (represent-ing a typical location of instances of C) is used toconstruct a spatial representation of C based on thecentroids method described in the previous subsec-tion (figure 5 step 2.centroids).

The algorithm 1 illustrates the details of ourmethod. The main part of the algorithm corre-sponds to the levelwise colocation mining algorithmproposed in [12], only lines 2-4 and 9-11 correspondto the construction of the spatial clustering-basedcolocation patterns.

Algorithm 1 Spatial clustering-based colocationmining algorithmRequire: a set of spatial objects O, a set of features F ,

a boolean spatial relationship R, the participation indexthreshold α

Ensure: the spatial representations of interesting colocations1: Cand1 = F ;k = 12: for all f ∈ F do

3: CFf = clusterObjectsFeature(O, f)4: end for

5: while Candk 6= ∅ do

6: for all C ∈ Candk do

7: TIC = generateTableInstance(O, C)8: if pi(C) ≥ α then

9: for all cluster ∈ clusterTIColoc(TIC ,⋃

∀f∈CCFf )

do

10: Spatial ColocC = generateCentroidsColoc(cluster, C)11: end for

12: Interest Colock = Interest Colock ∪ C13: end if

14: end for

15: Candk+1 = X ⊆ F | ∀ Y ⊂ X,

Y ∈ Interest Colock\⋃

j≤kCandj

16: k = k + 117: end while

18: Return⋃

∀C∈

⋃0<i<k

Interest ColociSpatial ColocC

The levelwise strategy proposed in [20] for colo-cation mining is based on the classical Apriori algo-rithm [1]. Note that a generalization of the Apriorialgorithm is also described in [18]. The principle

5

Page 6: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

Figure 4: Simple representation using centroids of table instance objects

Figure 5: Clustering-based representation using table instance objects

of this strategy is to iteratively generate a set ofcandidate colocations of size k + 1 (i.e. coloca-tions having k+1 features), denoted Candk+1, fromthe set of interesting colocations of size k, denotedInterest Colock, and to test their correspondingparticipation index. Thus, this approach alternatescandidate generation and evaluation phases. Thecandidate generation is done in line 15 based on in-teresting collocations of size k. For each candidatecolocation generated, the evaluation phases is donein line 8, using the table instance processed in line7.

For the construction of the spatial representa-tions, the most simple solution would have beento execute a clustering algorithm on the table in-stance of each colocation. However, this solutionwould have been time consuming, considering youmay have thousands of colocations. Therefore, wedevelop a two-step clustering approach integratedin the mining algorithm. The two steps are:

• a clustering of the objects of each feature (line

2 to 4), run once at the beginning of the algo-rithm. Let CFf be the set of clusters obtainedwith the objects of feature f .

• a clustering of each colocation table instancebased on the clusters of each feature, using amerge and split approach (line 9).

First, for each feature f , we partition the ob-jects having feature f based on their coordinates(line 2 to 4), using the X-means clustering algo-rithm [19] implemented in Weka [10]. Then, we usethese clusters of objects as a basis for the cluster-ing of each table instance of an interesting coloca-tion C (function clusterT IColoc, line 9). Finally,for each cluster of instances generated, the func-tion generateCentroidsColoc (line 10) constructsthe corresponding spatial representation of C basedon the centroids of the objects of each feature. Thisapproach is illustrated for one interesting coloca-tion in the example of figure 6.

More precisely, the function clusterT IColoc pro-cesses the table instance using a merge and split

6

Page 7: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

Figure 6: Example of construction of the visual representations of colocation X,Y,Z using the mergeand split approach

approach. The principle of this method is to selectthe feature f having the highest number of clusters,and to split the instances of C w.r.t. to the clus-ters of f . However, two instances in two differentpartitions w.r.t clusters of f can have in commonan object of an other feature of C. In those cases,we have conflictual clusters, i.e. objects belongingto several partitions. For example, in figure 6-a,if we partition colocation instances w.r.t. clustersof Z, the second and third instances of X, Y, Z

would be in different clusters, whereas they sharethe object Y2.

To avoid this problem, our solution is to itera-tively merge the clusters of the feature having thehighest number of clusters, and finally split thetable instance w.r.t. these clusters when nothingcan be merged anymore (stability condition). Thismethod is illustrated in figure 6. Given two featuresf and g for a colocation C, we have two situations:

• suppose that two instances are in different par-

7

Page 8: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

titions w.r.t. clusters of f , but have in commonan object of g. We merge the two clusters of fleading to such partitioning (as a consequencethese clusters will not be conflictual anymore).For example, in figure 6-c, the second and thirdinstances belong to different partitioning w.r.t.clusters of X , but they have in common theobject Y2. Consequently, the two conflictingclusters of X are merged (figure 6-d).

• suppose that two instances are in different par-titions w.r.t. clusters of f , but have objects be-longing to the same cluster of g. We split thecorresponding cluster of g. For example (fig-ure 6-e), the fourth and fifth instances belongto different partitioning w.r.t. clusters of Y ,but they include the objects X4 and X5 whichare in the same cluster of X . Consequently,we split the corresponding cluster of X w.r.t.clusters of Y (figure 6-f).

Note that the interpretation may be difficult iflot of spatial patterns are generated. The zoomfunctionality of the GIS partially solves this prob-lem, but in some cases it may not be enough. Todeal with this problem, the user can choose to ex-tract a condensed representation of the interestingcolocations, i.e. a subset of colocations represent-ing the solutions. Thus, our system also proposesthe extraction and visualization of maximal inter-esting colocations w.r.t. set inclusion (also calledthe positive border in [18]), instead of all interest-ing colocations (see [18, 9] for more details).

4.3 Advantages of our proposition

This visualization approach has three main advan-tages w.r.t. existing solutions. Firstly, we get a spa-tial visualization of colocations totally integratedin the GIS, and thus adapted to experts needs andpractices. The original data is not affected by ourapproach, only an additional layer is added. More-over, it can take advantage of the GIS functionali-ties. For example, the user can zoom on the map inorder to have either a general view of all the colo-cations (figure 9, in the middle), or a detailed viewof one or several colocations (figure 9 on the right).

Secondly, this representation gives additional in-formations on the colocations. Actually, it allows tovisualize where and how an interesting colocationis spatially located. For example, figure 5 (right

side) shows that colocations A, B, C is generallylocated in the north west of the map, A, B, E, Fin the south-west and A, B, D in the south-east.Thus, our approach has the advantage to provideto experts a global picture of the spatial distribu-tion of the colocations. Using a classical visualiza-tion approach, it would have been more difficult tohave such informations. Moreover, our approachalso allows to visualize with precision how the fea-tures of a colocation are w.r.t. to each others. Forexample, figure 3 shows that the instances of colo-cation A, B, D are generally closer than the onesof colocation A, B, C. In the same way, the spa-tial representation of A, B, D shows that objectshaving feature B are generally below the ones hav-ing features A and D, and objects having featureD are generally located on the left of the ones hav-ing feature A. Furthermore, note that experts caneasily visualize the importance of a colocation andits themes thanks to the color system.

Finally, this visualization approach do not re-quire additional post processing step, since it isdone during the mining algorithm using table in-stances processed for colocation mining.

5 Application

The proposals discussed in this paper have beenintegrated in a prototype coupled with a GIS (fig-ure 7). This prototype is based on a data miningtool called iZi [8]. This tool is used to solve inter-esting pattern mining problems as defined in theformal framework of [18], by providing generic al-gorithm implementations. This tool has been ex-tended to process spatial clustering-based coloca-tions patterns and to store them in a PostGis geo-graphical database. Quantum GIS (a free desktopapplication framework) is used as an interface tovisualize data and colocations stored in the GIS.

We used our prototype to study soil erosion ona mountainous area of 9km2 in New Caledonia. Inthis area, natural erosion takes place as well as ero-sion related to mining activities. When studyingsoil erosion, three important thematic layers wereconsidered: soil erosion (6 features), nature of theground (13 features), and vegetation (13 features).This dataset is composed of more than 9000 ob-jects. The studied objects resulted from vector dataof a geographical database. The spatial relation-

8

Page 9: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

Figure 7: Architecture of the prototype

ship studied was a neighbor relationship based ona distance threshold between the centroids of theareas.

Figure 8 shows the performances of colocationmining (V 0), colocation mining with the centroidvisualization (Centroids) and colocations miningwith clustering-based visualization (Clustering).As shown by this figure, performances are accept-able for experts (same order of magnitude) w.r.t.the value-added informations provided, especially ifwe take into consideration that such data is rarelyupdated. Actually, most of the additional process-ing time is due to the non-optimized implementa-tion of our prototype. Indeed, in this first work, wefocus more on results to demonstrate the interestof this approach, than on performances. For ex-ample, the top plot (minimum participation indexequal to 0.5) shows the cost of the weka invocationfor the first clustering on features. Indeed, part ofthe runtime is due to external calls to Weka usingintermediate files. In the bottom plot, the differ-ence between V 0 and Centroids shows that mostof the processing time is not due to the clusteringsteps, but to the data access in the GIS. Actually,SQL queries and database parameters are not op-timized in this version of our prototype.

Table 1 shows the number of colocation for dif-ferent distance and participation index thresholds,

100

1000

10000

0 0.1 0.2 0.3 0.4 0.5

Tota

l T

ime

(sec

)

Minimum participation index

Minimum distance 200m

V0Centroids

Clustering

100

1000

10000

0 0.1 0.2 0.3 0.4 0.5

Tota

l T

ime

(sec

)

Minimum participation index

Minimum distance 300m

V0 Centroids

Clustering

Figure 8: Performances of the different approaches

Distance Participation index threshold

0.5 0.3 0.1

200m nb colocations 21 68 266avg nb instances 16 478 11 974 8 365for a colocation

total nb instances 346 046 814 263 2 225 118for all colocations

nb spatial 31 112 510representations

300m nb colocations 55 163 711avg nb instances 50 803 78 347 87 100for a colocation

total nb instances 2 794 205 12 770 670 61 928 727for all colocations

nb spatial 84 258 1349representations

Table 1: Number of colocations and spatialclustering-based patterns

9

Page 10: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

and the corresponding number of spatial clustering-based colocation patterns. In average, the num-ber of spatial representations is no more than twicethe number of colocations. The average number

of instances for a colocation represents the numberof patterns that would have been displayed in themap using a classical visualization approach suchas in [2], i.e. selection of a colocation in a reportand display of the corresponding instances on themap. The total number of instances for all colo-

cations represents the number of patterns on themap if we display all the instances of all interestingcolocations at the same time. These two indica-tors illustrate the interest of our approach, sincethe number of patterns displayed using our solu-tion is much lower than the two others.

The visualization of the spatial clustering-basedrepresentations for one of these experiments is pre-sented in figure 9. We can see the spatial ob-jects (left screenshot), their corresponding spatialclustering-based colocations (screenshot in the cen-ter), and a zoom on a specific area (right screen-shot). This figure illustrates the advantage of ourapproach by providing to experts a global pictureon where and how the colocations are generally lo-cated. It also shows how experts can use the zoomfunctionality of the GIS to have a finer view on aspecific area.

These results were analyzed and validated by ageologist, specialist of soil erosion in New Caledo-nia. They point out known correlations about soilerosion in this area. The more significant coloca-tions are the associations between sensitive trails,mining zones, river erosion and sparse vegeta-tion, and between mines, hillslope erosion, woody-herbaceous scrub and sensitive trails or river ero-sion. They highlight the environmental damagenear the areas where humans have used the soils.Another example is that colocations show thatplant systems can also be related to the environ-ment degradation. The interest of this approachfor the experts is to have a formal and intuitiveapproach to study such phenomenon, to automatethe analysis and to quantify the importance of thecorrelations thanks to the participation index.

6 Conclusion

In this paper, we propose a clustering-basedmethod for the visualization of colocation pat-terns. The visualization method extends the colo-cation concept with spatial informations and isdeeply integrated in the colocation mining algo-rithm. Moreover, the cartographic representationof these patterns better fits with experts practice.The whole process has been successfully integratedin a prototype based on PostGIS. To our knowl-edge, existing visualization approaches does nothave these advantages. Finally, we validated ourmethod through experiments on a real-world geo-logical dataset. The analysis of experimental re-sults by domain experts has confirmed the added-value of the method.

Acknowledgments. The authors wish to thankIsabelle Rouet, geologist and expert in soil erosion,for providing the data and validating the results.

References

[1] R. Agrawal and R. Srikant. Fast algorithms formining association rules in large databases. InJ. B. Bocca, M. Jarke, and C. Zaniolo, ed-itors, VLDB, pages 487–499. Morgan Kauf-mann, 1994.

[2] G. L. Andrienko and N. V. Andrienko.Knowledge-based visualization to support spa-tial data mining. In IDA, pages 149–160, 1999.

[3] A. Appice, M. Ceci, A. Lanza, F. A. Lisi, andD. Malerba. Discovery of spatial associationrules in geo-referenced census data: A rela-tional mining approach. Intell. Data Anal.,7(6):541–566, 2003.

[4] V. Bogorny, J. F. Valiati, S. da Silva Ca-margo, P. M. Engel, B. Kuijpers, and L. O.Alvares. Mining maximal generalized fre-quent geographic patterns with knowledgeconstraints. In ICDM, pages 813–817. IEEEComputer Society, 2006.

[5] C. Brunk, J. Kelly, and R. Kohavi. Mine-set: An integrated system for data mining. InKDD, pages 135–138, 1997.

10

Page 11: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

Figure 9: Visualization of colocations on soil erosion data (threshold: 0.1, distance: 300m)

[6] M. Ceci, A. Appice, and D. Malerba. Discover-ing emerging patterns in spatial databases: Amulti-relational approach. In PKDD’07, vol-ume 4702 of LNCS, pages 390–397. Springer,2007.

[7] M. Celik, J. M. Kang, and S. Shekhar. Zonalco-location pattern discovery with dynamicparameters. In IEEE ICDM’07, pages 433–438. IEEE Computer Society, 2007.

[8] F. Flouvat, F. De Marchi, and J.-M. Petit.The izi project: easy prototyping of interestingpattern mining algorithms. In Advanced Tech-

niques for Data Mining and Knowledge Dis-

covery, LNCS, pages 1–15. Springer-Verlag,2009.

[9] F. Flouvat, N. Selmaoui-Folcher, D. Gay,I. Rouet, and C. Grison. Constrained coloca-tion mining : application to soil erosion char-acterization. In S. Y. Shin and S. Ossowski,editors, SAC. ACM, 2010.

[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer,P. Reutemann, and I. H. Witten. The WEKA

Data Mining Software: An Update, volume 11.2009.

[11] J. Han and M. Kamber. Data Mining, Second

Edition : Concepts and Techniques. MorganKaufmann, January 2006.

[12] Y. Huang, S. Shekhar, and H. Xiong. Dis-covering colocation patterns from spatial data

sets: A general approach. IEEE Trans. Knowl.

Data Eng., 16(12):1472–1485, 2004.

[13] D. A. Keim. Information visualization and vi-sual data mining. IEEE Trans. Vis. Comput.

Graph., 8(1):1–8, 2002.

[14] K. Koperski and J. Han. Discovery of spa-tial association rules in geographic informationdatabases. In M. J. Egenhofer and J. R. Her-ring, editors, SSD, volume 951 of Lecture Notes

in Computer Science, pages 47–66. Springer,1995.

[15] C. K.-S. Leung, P. Irani, and C. L. Carmichael.Wifisviz: Effective visualization of frequentitemsets. In ICDM, pages 875–880. IEEEComputer Society, 2008.

[16] F. A. Lisi and D. Malerba. Inducing multi-level association rules from multiple relations.Machine Learning, 55(2):175–210, 2004.

[17] D. Malerba. A relational perspective onspatial data mining. International Journal

of Data Mining, Modelling and Management,1(1):103–118, 2008.

[18] H. Mannila and H. Toivonen. Levelwise searchand borders of theories in knowledge discov-ery. Data Min. Knowl. Discov., 1(3):241–258,1997.

[19] D. Pelleg and A. W. Moore. X-means: Ex-tending k-means with efficient estimation of

11

Page 12: A clustering-based visualization of spatial patternspages.univ-nc.nc/~selmaoui/Research/rr09-ClusteringVis.pdf · GIS The visualization of data mining results is essential to have

the number of clusters. In P. Langley, edi-tor, ICML, pages 727–734. Morgan Kaufmann,2000.

[20] S. Shekhar and Y. Huang. Discovering spa-tial co-location patterns: A summary of re-sults. In C. S. Jensen, M. Schneider, B. Seeger,and V. J. Tsotras, editors, SSTD, volume 2121of Lecture Notes in Computer Science, pages236–256. Springer, 2001.

[21] J. S. Yoo and S. Shekhar. A joinless approachfor mining spatial colocation patterns. IEEE

Trans. Knowl. Data Eng., 18(10):1323–1337,2006.

12